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This  thesis  considers  the  problem  of  statistically  evaluating  the  closeness  of 
multiple  genes  based  on  their  DNA  intronic  regions. 

The  purpose  of  intronic  regions  in  a  gene  is  unknown.  Recent  research  suggests 
that,  in  cancer  related  genes,  intronic  regions  may  play  a  role  in  regulating  disease 
susceptibility.  To  investigate  whether  the  intronic  features  of  cancer  related  genes  differ 
from  non-regulatory  genes,  a  collection  of  oncogenes,  tumor  suppressor  genes,  and  non- 
regulatory  genes  involved  in  enzyme  metabolism  are  analyzed.  A  statistical  methodology 
is  employed  to  determine  whether  features  of  these  genes’  intronic  regions  will  result  in 
clustering  by  regulatory  group. 

This  is  the  first  comparison  of  intronic  attributes  for  33  homo  sapien  genes. 
Attributes  analyzed  include  mean  of  intronic  log  lengths,  standard  deviations  of  intronic 
log  lengths  and  number  of  intronic  regions  in  a  gene.  The  analysis  also  includes  the 
available  ordered  DNA  data  from  these  intronic  regions.  The  process  begins  with  creation 
of  first-order  Markov  transition  counts  for  all  intronic  regions  in  each  gene.  A  hypothesis 
testing  approach  employs  these  counts  to  determine  whether  the  order  of  the  nucleotides 
in  each  intronic  region  is  random  or  whether  this  order  shows  statistical  evidence  of  a 


first-order  Markov  chain. 


A  test  of  homology  on  the  Markov  transition  counts  from  comparable  intronic 
regions  of  two  genes  is  performed.  The  average  of  the  Chi-square  test  statistics  for  all 
intronic  pairs  between  two  genes  creates  a  measure  of  homology  that  depicts  the  overall 
closeness  between  each  gene  pair.  Results  for  all  gene  pairs  are  combined  into  a  distance 
matrix,  demonstrating  that  results  of  a  statistical  hypothesis  test  can  be  used  as  a  distance 
metric.  A  hierarchical  cluster  analysis  is  performed  based  on  this  distance  matrix. 

Several  significant  biological  results  were  observed.  Sixty-seven  percent  of 
intronic  regions  in  the  data  set  were  first-order  Markov  processes,  providing  evidence  that 
not  all  intronic  regions  are  random  DNA  sequences.  Tumor  suppressors  clustered  based 
on  homology  results.  Additionally,  oncogenes  and  non-regulatory  genes  clustered  based 
on  number  of  introns,  and  based  on  a  combination  of  number  of  introns  and  means  of  the 
intronic  log  lengths  in  a  gene. 
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CHAPTER  I 


INTRODUCTION 

Purpose  of  the  Study 

The  primary  goal  of  this  thesis  is  the  development  of  a  statistical  methodology  to 
describe,  quantify  and  compare  currently  known  and  available  intronic  regions  from 
oncogenes,  tumor  suppressor  genes,  and  non-regulatory  genes  found  in  the  process  of 
enzyme  metabolism.  A  test  of  homology  on  Markov  transition  coimts  from  comparable 
intronic  regions  is  performed  on  all  possible  pairs  of  genes.  The  average  of  the  Chi- 
square  test  statistics  for  all  intronic  pairs  between  each  gene  pair  creates  a  measure  of 
homology  that  depicts  the  overall  closeness  between  the  gene  pair.  On  a  more  global 
scale,  the  issue  of  whether  intronic  regions  of  regulatory  and  non-regulatory  genes  cluster 
based  on  degree  of  regulatory  function  is  investigated.  Hierarchical  cluster  analysis  is 
used  throughout  this  thesis  to  determine  gene  groupings. 

Chapter  II  details  previous  biological,  statistical  and  computer  contributions  to  this 
research  area.  Chapter  III  describes  the  data  collection  process  and  provides  basic 
information  about  the  genes  in  the  data  set. 

Chapter  IV  contains  the  first  published  comparison  of  intronic  attributes  for  the  homo 
sapien  genes  examined  in  the  data  set.  Available  intronic  features  include:  number  of 
introns  in  a  gene,  mean  of  the  intronic  log  lengths  in  a  gene,  and  the  standard  deviation  of 
intronic  log  lengths  in  a  given  gene. 

Chapter  V  answers  the  question  of  whether  intronic  sequence  data  in  different 
regulatory  groups  are  totally  random  or  show  evidence  of  a  first  order  Markov  process. 


Numerous  articles  state  that  a  Markov  process  is  present  in  DNA  sequences  (see  Chapter 
II).  Mostly  focused  on  the  coding  regions,  a  few  references  suggest  that  Markov  structure 
may  also  hold  for  introns.  There  remains  a  prevalent  belief  in  the  biological  community 
that  introns  do  not  exhibit  evidence  of  a  Markov  process.  By  use  of  a  hypothesis  testing 
approach,  this  thesis  show  that  the  67  percent  of  introns  in  the  data  set  do  exhibit 
evidence  of  a  Markov  process. 

Assuming  Markov  processes,  a  statistical  hypothesis  test  is  applied  to  determine 
whether  comparable  introns  from  each  gene  have  homologous  Markov  transition  matrices 
in  Chapter  VI.  The  resulting  Chi-square  test  statistics  are  averaged  for  each  gene  and 
placed  into  a  distance  matrix  that  is  used  in  the  clustering  process.  This  provides 
evidence  that  results  of  a  statistical  test  can  be  used  in  the  creation  of  a  distance  matrix. 
This  is  also  the  first  systematic  look  at  intronic  regions  both  within  a  gene  and  among 
genes. 

Results  of  all  cluster  analyses  in  this  thesis  are  examined  to  ascertain  whether  there  is 
any  consistent  evidence  of  clustering  by  gene  regulatory  function.  The  success  of  this 
methodology  suggests  that  a  similar  approach  can  be  taken  when  the  data  set  includes  a 
gene  of  unknown  regulatory  status. 

Scope  of  the  Study 

This  thesis  addresses  statistical  issues  related  only  to  the  human  nuclear  intronic 
regions  in  oncogenes,  tumor  suppressor  genes,  and  non-regulatory  genes  (involved  in  the 
process  of  enzyme  metabolism).  Because  the  evolutionary  conservation  of  intronic 
regions  between  species  is  known  to  be  relatively  weak,  it  is  not  prudent  to  generalize 
these  results  to  other  organisms.  Similarly,  this  study  draws  no  conclusions  concerning 
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highly  regulatory  genes  related  to  diseases  other  than  cancer,  nor  does  it  provide 
conclusions  concerning  non-regulatory  genes  involved  in  human  maintenance  processes 
other  than  enzyme  metabolism.  Lastly,  as  all  introns  analyzed  in  this  study  are  nuclear 
introns,  generalizations  to  other  intronic  groups  (i.e.,  groups  I,  II  and  III  introns  and 
transfer  ribonucleic  acid  (tRNA)  introns)  should  not  be  made.  These  groups  are  known  to 
be  different  (Hickson,  1989). 

The  methodology  in  this  thesis  seeks  a  global  measure  of  closeness  between  two 
genes  that  does  not  utilize  traditional  gene  alignment  and  scoring  based  on  percentage 
match.  There  are  times  when  local  and  global  alignments  with  matching  are  necessary 
and  preferable;  e.g.,  multiple  alignment  of  protein  products  (where  the  biological  function 
of  the  protein  products  is  relatively  well  explored),  searches  for  transcription  factors, 
searches  for  oligonucleotides,  and  single  alignment  searches  of  highly  conserved  coding 
regions.  This  thesis  is  not  intended  denigrate  work  accomplished  in  these  areas,  but 
offers  an  alternative  approach  to  current  methods  when  drawing  conclusions  about  the 
relationship  between  two  genes  based  on  the  entire  intronic  data  available.  The 
homology  test  used  in  this  thesis  is  based  on  Markov  processes  rather  than  on  aligned 
matches  that  provide  a  composite  score. 

Although  this  research  is  preliminary,  it  is  hoped  that  the  conclusions  will  help  to 
determine  possible  avenues  for  future  biological  laboratory  research.  Results  observed 
may  change  as  more  intronic  gene  sequence  data  become  available  and  the  methodology 
can  be  enhanced  to  incorporate  more  data. 
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CHAPTER  II 


REVIEW  OF  THE  LITERATURE 

Biological  Review 
Gene  Structure 

Genes  are  made  of  deoxyribonucleic  acid  (DNA),  which  codes  for  the  production  of 
polypeptide  chains  that  are  joined  in  specific  orders  to  produce  proteins  (Friedman,  et  al, 
1992).  DNA  molecules  exist  primarily  in  the  cell  nucleus  and  consist  of  two 
complementary  chains  that  twist  to  form  a  double  helix.  Each  chain  is  composed  of  four 
nucleotides  that  contain  a  deoxyribose  sugar,  a  phosphate,  and  a  pyrimidine  or  purine 
base.  Purine  bases  are  adenine  (A)  and  guanine  (G);  pyrimidine  bases  are  cytosine  (C) 
and  thymine  (T).  Thymine  occurs  only  in  DNA  and  is  replaced  by  Uracil  (U)  in 
ribonucleic  acid  (RNA).  Ribonucleic  acid  is  similar  to  DNA  except  that  it  is  composed 
of  a  ribose  sugar  and  is  found  in  the  cytoplasm  of  the  cell  as  well  as  the  nucleus. 

Role  of  Intronic  Regions 

In  1977,  it  was  discovered  that  eukaryotic  genes  are  interrupted  (Lewin,1994).  The 
interruption  of  eukaryotic  genes  means  there  are  huge  blocks  of  DNA  that  do  not  code  for 
proteins.  Although  the  existence  of  these  interrupted  regions  is  now  known,  the  main 
focus  of  most  genetics  research  is  the  DNA  coding  regions,  equivalently  called  exonic 
regions,  exons,  or  the  translated  portion  of  DNA  (Lewin,  1994).  Yet,  less  than  3  percent 
of  DNA  carries  code  that  produces  proteins.  The  purpose  of  the  remaining  97  percent  of 
the  code  is  currently  unknown.  Explanations  for  these  regions  in  the  literature  are  sparse 
and  often  conflicting  (Wills,  1991). 


Three  types  of  non-coding  regions  are  found  in  eukaryotic  DNA.  These  are  intronic 
regions,  untranslated  regions  and  intergenic  spacers.  For  the  purpose  of  this  thesis, 
intronic  regions  of  nuclear  genes  are  the  only  non-coding  regions  I  will  discuss  and 
analyze. 

Intronic  regions,  or  introns,  are  non-coding  segments  of  a  gene  that  are  transcribed 
but  then  removed  from  the  transcript  by  splicing  together  the  coding  regions  (exons)  on 
either  side  of  them.  Once  removed,  they  are  quickly  degraded  in  the  nucleus  (King  and 
Stansfield,  1990).  Transcription  is  the  process  by  which  the  genetic  information  on  the 
sense  strand  of  DNA  is  transcribed  into  a  complementary  messenger  RNA  (Wills,  1991). 
With  very  rare  exceptions,  the  sequence  GU  represents  the  beginning  of  an  intron  and  the 
sequence  AG  designates  the  end  of  an  intron  (Goodman,  1994  and  Lewin,  1994).  Further 
studies  have  been  done  to  determine  if  there  is  more  of  a  pattern  than  simply  a  particular 
dinucleotide  at  each  end.  The  results  have  not  been  consistent  (Penotti,  1991  and 
Gelfand,  1989). 

Several  theories  provide  possible  explanations  for  the  purpose  of  intronic  DNA. 

These  range  from  the  theory  that  introns  are  meaningless  evolutionary  relics  i.e.,  “junk 
DNA”  (Flam,  1994)  to  the  theory  that  intronic  regions  somehow  control  gene  regulation 
(Wills,  1994).  As  specific  examples  of  intronic  regions  controlling  regulation,  it  was 
proposed  that  intronic  structures  of  cancer-related  genes  such  as  K-ras  and  H-ras  regulate 
susceptibility  to  cancer  (You  et  al,  1992,  Hashimoto-Gotoh  et  al,  1988,  Malkinson  and 
You,  1994,  Telliez  et  al.,  1995).  Laboratory  experiments  that  intentionally  splice  intronic 
regions  out  of  the  gene  before  transcription  have  demonstrated  an  increase  or  decrease  in 
protein  production  (Bordonaro,  1994  and  Bossu,  1994).  Although  these  examples 
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support  the  premise  that  intronic  regions  do  have  some  sort  of  regulatory  purpose,  the 
possible  mechanism  behind  this  process  is  currently  unknown. 

The  lack  of  consensus  over  intronic  purpose  has  unfortunately  resulted  in  a  large  hole 
in  the  literature  when  it  comes  to  quantifying  and  classifying  known  facts  about  these 
non-coding  regions.  Another  major  reason  for  this  lack  of  information  relates  to  the 
purpose  behind  the  Human  Genome  Organization  (HUGO).  Until  very  recently,  HUGO 
did  not  solicit  the  collection  and  analysis  of  intronic  information.  The  primary  goals  of 
HUGO  give  the  highest  priority  to  mapping  and  determination  of  the  complete  human 
DNA  sequence  (Pearson  and  Soil,  1991).  However,  a  gene’s  total  coding  sequence 
defines  a  complete  sequence.  Published  DNA  code  need  not  include  the  intronic  regions 
to  be  designated  as  complete  in  the  molecular  genetics  literature. 

At  the  present  time,  the  sequence  author  decides  whether  to  enter  intronic  information 
into  GenBank  (national  repository  for  DNA  sequence  data  sponsored  by  the  National 
Center  for  Biotechnology  Information  (NCBI)).  Many  authors  do  not  enter  non-coding 
information  because,  unless  they  have  a  computer  with  a  digitizer,  they  have  to  enter  the 
code  into  GenBank  manually  via  INTERNET.  It  is  tedious  to  enter  these  regions  that  can 
easily  be  100,000  base  pairs  (bp)  long  (Lewin,  1994).  This,  plus  the  belief  that  introns 
may  serve  no  useful  purpose,  leads  many  researchers  to  disregard  entering  intronic 
sequence  code  into  GenBank.  The  sporadic  nature  of  intronic  sequence  data  entry  into 
GenBank  and  the  difficulty  of  manipulating  GenBank  for  exploratory  data  searches  result 
in  little  research  concerning  introns. 

Only  a  few  literary  sources  quantify  intronic  regions.  One  fact  consistently  reported 
in  the  literature  is  that  the  number  of  introns  in  a  gene  can  vary  greatly  across  genes. 
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Similarly,  the  lengths  of  these  non-coding  regions  vary  greatly  both  between  genes  and 
within  a  given  gene  (Lewin,  1994).  However,  a  statistical  definition  of  “varies  greatly” 
and  simple  information  like  mean  intronic  length  and  variance  of  intronic  length  for  a 
given  human  gene  is  frequently  unknown.  The  most  basic  descriptive  statistics  are 
unavailable.  Questions  of  interest  like  whether  genes  with  very  small  numbers  of  introns 
(or  very  large  numbers  of  introns)  tend  to  be  alike  in  function  remain  unanswered.  A 
paper  by  Hawkins  (1988)  is  frequently  referenced  by  the  biological  community  as  a 
sources  for  determination  of  intron  length  in  various  parts  of  genes  for  vertebrates, 
insects,  plants  and  fungi.  The  differences  between  phyla  are  depicted  in  Table  2.1 . 


Table  2.1:  Average  Length  and  Number  of  Introns  Collected  by  Hawkins 


5’  Introns 

Internal  Introns 

3’  Introns 

Vertebrates 

1811(91) 

1127(1941) 

681(25) 

Insects 

5507(19) 

622(210) 

Fungi 

86(126) 

Plants 

249(200) 

Average  lengths  are  provided  in  bp.  Numbers  of  introns  collected  are  provided  in 
parentheses.  5’  introns  (i.e.,  introns  at  the  beginning  of  a  gene)  are  introns  wholly 
within  the  5’  non-coding  region.  3’  introns  (i.e.,  introns  at  the  end  of  a  gene)  are 
introns  within  the  3’  untranslated  portion  of  the  gene.  Not  all  intronic  information 
was  present  for  every  gene  (Hawkins,  1988). 


Although  Hawkins  attempted  to  do  a  very  thorough  job,  the  paper  suffers  from  the 
limitations  of  information  known  at  that  time.  First,  the  sample  size  is  extremely  small. 
Second,  in  1988  the  biological  estimates  of  true  lengths  for  long  introns  were  not 
available.  This  created  a  degree  of  selection  bias  in  his  results.  Third,  data  for  humans 
are  contained  within  the  vertebrates  class.  If  there  are  differences  moving  up  the  ladder 
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to  the  higher  organisms,  there  may  be  differences  within  the  vertebrate  phylum  as  well. 
This  means  the  group  of  greatest  biological  importance  to  us  is  somewhat  obscured  by 
the  remainder  of  the  vertebrate  data.  Lastly  and  most  importantly,  Hawkins  does  not 
differentiate  between  intronic  lengths  in  regulatory  genes  versus  non-regulatory  genes; 
hence,  potential  differences  between  these  two  groups  are  confounded  by  the  lack  of 
stratification. 

In  1988,  Hawkin’s  study  was  a  time  consuming  task  based  on  current  information.  In 
1996,  this  paper  should  be  re-evaluated  and  replaced  with  more  modem  information. 
Although  still  far  from  perfect,  information  in  this  thesis  will  reflect  a  more  accurate 
picture  than  was  possible  ten  years  ago. 

Rationale  for  the  Selection  of  Regulatory  and  Non-regulatory  Genes 

Regulatory  genes  are  genes  whose  primary  function  is  to  control  the  rate  of  synthesis 
of  the  products  of  other  distant  genes  (King  and  Stansfield,  1990).  Oncogenes  are  normal 
genes  involved  in  the  regulation  of  cell  growth,  but  can  lead  to  neoplasia  (abnormal  tissue 
growth)  when  mutated,  overexpressed  or  amplified  (i.e.,  a  portion  of  the  DNA  sequence 
is  replicated)  When  this  occurs  that  they  induce  uncontrolled  cell  proliferation  (Friedman 
et  al,  1990).  Tumor  suppressor  genes  are  normal  genes  whose  function  is  to  prevent  the 
development  of  neoplasia,  but  when  mutated  the  ability  to  prevent  this  development  is 
lost  (Friedman,  1990).  Non-regulatory  genes  involved  in  enzyme  metabolism  are  genes 
whose  primary  function  involves  the  physical  and  chemical  processes  by  which  living 
cells  produce  and  maintain  themselves,  and  by  which  energy  is  made  available  for  use  by 
the  organism  (King  and  Stansfield,  1990). 
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Degree  of  mRNA  stability  appears  to  play  a  role  in  regulating  the  level  of  gene 
expression  during  cell  growth  and  differentiation  (Schiavi,  1992).  In  mammalian  cells, 
mRNA  half-lives  range  from  as  short  as  minutes  to  as  long  as  days.  Among  the  shortest- 
lived  mRNAs  are  the  FOS  and  MYC  proto-oncogene  mRNAs.  These  have  protein  half- 
lives  of  approximately  20-30  minutes  and  mRNA  half-lives  of  approximately  30  minutes 
(Hargrove,  1989).  The  short  half-lives  of  these  mRNAs  appear  to  play  a  critical  role  in 
the  normal  function  of  these  genes.  The  quick  deregulation  of  FOS  or  MYC  mRNA  may 
somehow  contribute  to  oncogenesis  (Schiavi,  1992).  As  a  comparison,  one  of  the  non- 
regulatory  genes  chosen  for  this  thesis,  glyceraldehyde-3 -phosphate  dehydrogenase,  has  a 
protein  half-life  of  75-130  hours  and  a  mRNA  half-life  of  8  hours  (Hargrove,  1989),  a 
half-life  much  longer  than  the  oncogenes’  half-lives. 

The  issue  of  mRNA  stability  and  gene  regulation  raises  several  questions  of  interest. 
How  different  are  the  half-lives  of  oncogenes  and  tumor  suppressor  genes  from  non- 
regulatory  genes?  Does  this  difference  have  anything  to  do  with  the  knovm  physical 
features  that  depict  that  gene  (e.g.,  length  of  the  gene,  number  of  intronic  regions,  etc.)  or 
with  the  intronic  DNA  sequence  code  within  that  particular  gene?  Is  there  a  combined 
effect  from  number  of  introns,  the  means  and  standard  deviations  of  the  intomic  log 
lengths,  and  the  DNA  sequence  code?  Do  genes  tend  to  cluster  based  on  degree  of 
regulatory  function?  Answers  to  these  questions  are  unexplored  in  the  literature.  The 
gene  groupings  in  this  thesis  seek  to  provide  answers  to  the  above  questions. 
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Statistical  Review 


Cluster  Analysis  and  DNA  Sequence  Analysis 

Cluster  analysis  is  the  formal  study  of  methods  and  algorithms  for  organizing  data 
objectively  into  groups,  or  collections  of  groups  (Hirtle,  1995).  Because  groupings  of 
data  are  often  of  interest  in  DNA  research,  cluster  analysis  has  been  used  widely.  Some 
of  the  oldest  uses  of  clustering  and  classification  are  connected  to  the  development  of 
lineage  among  plants  and  animals  (Haitigan,  1975).  Aristotle  was  the  first  to  use 
classification  on  500  plants  and  animals  based  on  a  comparison  system  he  developed. 
Linnaeus  developed  the  present  classification  system  of  plants  and  animals  in  1753.  The 
tree  structure  (dendrogram)  that  is  often  associated  with  standard  clustering  procedures 
was  used  by  taxonomists  after  Darwin  developed  his  evolutionary  theories.  Not 
surprisingly,  the  majority  of  the  DNA  statistical  literature  concerns  the  use  of  cluster 
analysis  as  related  to  evolutionary  issues  (King  et  al.,  1979,  Meyer,  1991,  Ladunga, 

1992).  Another  area  that  has  received  much  recent  attention  is  clustering  groups  of  cells 
based  on  various  cellular  properties  (Mukwaya  and  Welch,  1989,  Musser  et  al.,  1990, 
Garrels  et  al,  1990,  van  der  Mei  et  al.,  1991). 

Unfortunately,  there  has  not  been  much  work  using  classification  and  clustering  in 
grouping  DNA  sequence  data.  No  papers  use  cluster  analysis  on  intronic  sequences.  The 
few  papers  relating  to  sequence  analysis  generally  used  only  one  method  of  clustering  to 
investigate  a  particular  biological  issue  (Rowe,  et  al,  1984,  Authn  and  Fieldes,  1992, 
Bickam  et  al,  1995, ).  Not  much  progress  has  been  made  to  incorporate  what  is  known 
statistically  about  the  data  into  the  analysis.  Additionally,  most  researchers  do  not 
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question  the  merits  of  the  particular  clustering  approach  that  they  use;  they  use  whatever 

clustering  algorithm  to  which  they  have  access,  regardless  of  how  well  it  suits  their  data. 

Markov  Processes  and  DNA  Sequence  Analysis 

The  use  of  Markov  processes  in  DNA  research  has  also  been  widely  explored.  Some 

articles  focus  on  the  use  of  Markov  processes  in  nucleotide  sequence  analysis,  but  the 

majority  of  research  is  applied  to  the  creation  of  models  to  simulate  evolutionary  events. 

Since  evolutionary  issues  are  beyond  the  scope  of  this  paper,  this  review  focuses  on  only 

those  methods  concerning  nucleotide  sequence  analysis. 

To  perform  a  statistical  analysis  of  DNA,  one  desires  an  adequate  mathematical 

model.  After  the  first  sequencing  of  DNA,  it  was  observed  that  a  model  of  independent 

random  sequences  was  probably  inappropriate.  Numerous  models  were  proposed 

(Gelfand  et  al.,  1992),  including:  stationary  and  non-stationary  processes,  Markov 

models,  hidden  Markov  chains,  mixture  transition  distribution  models,  block  models,  and 

finite  automata.  Despite  a  substantial  amount  of  research,  a  completely  adequate  model 

for  the  description  of  DNA/protein  sequences  is  still  an  open  problem.  No  existing 

algorithm  has  been  able  to  generate  a  long  sequence  (>  10,000  bp)  that  can  pass  as  a 

naturally  occurring  DNA  sequence.  Markov  chains  are  probably  the  most  consistent  and 

widely  chosen  of  the  possible  models,  although  their  use  has  not  been  without 

controversy.  Neither  increasing  the  Markov  chain  order  nor  taking  into  account  biased 

frequencies  of  nucleotide  runs,  explains  all  the  anomalies  found  in  genetic  code. 

Older  examples  of  DNA  sequence  analysis  using  Markov  chains  date  back  to  the 

1970’s  (Fuchs,  1980).  In  early  analyses,  researchers  believed  they  would  find  a  Markov 

order  that  best  fit  all  DNA  sequences.  Gatlin  (1972)  examined  first-order  Markov  chain 
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models.  Hasegawa  and  Yano  (1975),  and  Figueroa  et  al.  (1978)  examined  second-order 
Markov  chain  models.  Garden  (1980)  provided  evidence  against  the  previous  belief  of  a 
possible  “best”  order  when  he  concluded  that  different  orders  of  Markov  chains  optimally 
fit  different  DNA  sequences.  In  his  analysis,  a  third-order  model  best  fit  a  (t)X174 
bacteriophage  sequence,  a  second-order  best  fit  both  the  'early  and  late  regions  of  the 
SV40  bacteriophage,  and  the  nucleotides  in  the  MS2  gene  (partial  sequence)  were 
independent. 

Considering  these  results.  Garden  developed  a  modeling  methodology  to  determine 
the  optimal  order  of  a  given  Markov  chain  by  applying  Tong’s  (1975)  procedure.  He 
applied  this  methodology  to  a  Markov  chain  with  a  finite  number  of  states.  The  basis  for 
Tong’s  procedure  is  the  Akaike  information  criterion  (AIC).  If  L(k)  is  the  likelihood 
corresponding  to  the  order  model,  the  number  of  independent  parameters  for  the 
model  is  r(k)  =  4*^^  -  4*.  The  criterion  for  selection  of  model  order  is  based  on  the 
statistic  R(k)  as  follows  (Fuchs,  1980): 

{k\  R(k)  =  -2lnL(k)  +  2r(k)  \s  minimized} 

Another  method  for  determining  the  Markov  chain  order  (based  on  Hoel,  1954)  used 
a  likelihood  ratio  test.  In  this  case,  is  the  -  2  In  likelihood  ratio  such  that 

*  Vm  =  -Wn  L(k)  -lnL(k  +  \)]. 

If  icTJic+j  exceeds  the  critical  value  of  a  chi-square  distribution  with  degrees  of  freedom 
given  by 

df(k  +  l\k)  =  T(k  +  l)-T(k), 
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the  difference  in  fit  between  the  k‘^  and  the  (k+l/‘  order  models  is  statistically  significant. 
Fuchs  (1980)  compared  the  Tong  procedure  and  the  likelihood  ratio  test  procedure  on  7 
complete  DNA  sequences  (all  bacteriophages,  viruses  and  plasmids).  He  hoped  to  obtain 
concordance  between  the  two  methods  but  instead  found  discrepancies  (i.e.,  each  method 
picked  a  different  optimal  order  for  each  of  the  seven  genes).  He  concluded  that  the  order 
of  the  model  selected  by  the  R(k)  criterion  was  directly  proportional  to  the  number  of 
nucleotides  in  the  sequence  (i.e.,  model  order  increased  as  the  length  of  the  DNA 
sequence  increased).  A  possible  explanation  is  that  longer  DNA  sequences  are  indeed 
different.  Alternatively,  and  more  likely,  the  longer  the  sequence,  the  higher  the  order  of 
the  model.  In  short  sequences,  the  frequencies  of  the  nucleotide  groups  are  small.  When 
this  occurs,  the  power  of  the  goodness  of  fit  test  diminishes  and  there  is  a  tendency  to 
accept  models  of  lower  order. 

This  raises  the  issue  of  whether  modeling  approaches  common  in  the  field  of  time 
series  analysis  are  appropriate  for  the  analysis  of  a  long  DNA  sequence.  Fuchs 
recommended  the  addition  of  an  analysis  of  residuals  when  investigating  the  order  of  a 
sequence.  This  allows  investigation  of  differences  between  observed  and  expected 
frequencies  in  groups  of  nucleotides,  plus  examination  of  model  order. 

Recent  DNA  sequence  analyses  using  Markov  chains  follow  Fuch’s 

recommendations.  Many  analyses  determine  a  k‘^  order  Markov  transition  matrix  that 

provides  frequencies  for  words  of  length  k.  The  observed  frequencies  are  compared 

against  the  expected  frequencies  to  provide  statistical  evidence  of  the  chosen  model. 

Examples  of  this  methodology  (i.e.,  the  same  basic  methods  applied  to  different  DNA 

sequences)  are  Phillips  et  al,  1987,  Barrai  et  al,  1990,  Lewis  et  al,  1995.  These  authors 
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searched  for  a  plausible  biological  conclusion  based  on  an  observed  pattern  in  a  DNA 
region  after  laboratory  sequencing. 

Only  one  paper  incorporates  the  areas  of  Markov  chains  and  DNA  homology  (Lake, 
1994).  The  paper  was  written  for  the  purpose  of  evolutionary  comparisons,  but  a  portion 
of  the  methodology  could  be  applied  to  DNA  nucleotide  homology.  The  paper  involves 
paralinear  distances.  Paralinear  distance  measures  the  distance  between  two  sequences. 
These  measures  are  easy  to  calculate,  valid  for  nucleotide  sequences,  and  gaps  (positions 
where  one  or  more  nucleotides  are  missing  in  a  strand  of  DNA)  may  be  included.  The 
paralinear  distance  between  two  sequences  i  and  j  is 

det  Jjj 

~  (detD.,y'^(detDjy'^  ' 


To  calculate  the  paralinear  distance  between  sequences  i  and  j,  one  first  computes  Jy,  the 
joint  probability  matrix.  For  example,  sequence  1  and  sequence  2  are  both  10  bp  long. 
Each  sequence  consists  of  only  nucleotides  C  and  T.  The  two  sequences  axe  as  follows: 

Sequence  1:  CTTCCCTCTC 
Sequence  2:  TTTCTTCCTC 


Jjj  is  represented  by 


Sequence  1 


Sequence  2 


C 

T 

c 

3 

3 

T 

1 

3 

6 

4 


Once  Jy  is  computed,  diagonal  matrices  D;  and  Dj  can  be  constructed  from  the  Jy  matrix 


where  D,  = 


6  0 
0  4 


(i.e.,  diag  (row  totals)  and  D2  = 


4  0 
0  6 


(i.e.,  diag  (columns  totals). 
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The  quantity  =  2  In  2  =  1 .386  determines  the  degree  of  similarity  between  the  two 
sequences. 

This  method  has  some  inconsistencies.  Even  though  it  claims  to  be  based  on  a 
Markov  chain,  the  frequency  matrix  is  based  simply  on  common  alignments  between  the 
two  sequences,  not  a  transition  matrix.  Also,  there  is  no  imderlying  methodology  to  use 
this  paralinear  distance  in  a  hypothesis  testing  approach,  so  there  is  no  way  to  judge  the 
value  oidij.  Thus,  this  method  does  not  help  determine  whether  two  sequences  are 
statistically  similar  or  different. 

Recently,  applied  researchers  have  moved  away  from  classical  Markov  chain 
methodology  and  toward  linguistic  approaches  (Sakakibara  et  al,  1994,  Mantagna  et  al, 
1994,  Dong  and  Searls,  1994)  and  algorithm  based  scoring  methods  (Arratia  et  al,  1988, 
Mott  et  al,  1989).  The  mathematical  and  statistical  theorists  who  have  continued  with 
the  Markov  chain  methodology  often  suffer  because  their  methodology  is  not  easily 
applied  to  large  amounts  of  DNA  sequence  data  (Gentleman  and  Mullin,  1989,  Kleffe 
and  Langbecker,  1990,  Kleffe  and  Borodovsky,  1992,  Kelly,  1994). 

While  the  Markov  chain  methodology  is  still  an  underlying  basis  for  statistical  DMA 
research,  the  focus  has  shifted  away  from  achieving  the  best  theoretical  methods  and 
instead  has  moved  toward  methods  that  can  better  handle  large  amounts  of  DNA 
sequence  data.  In  some  instances,  this  means  a  weakening  of  the  underlying  assumptions 
behind  the  model,  but  this  trade-off  is  necessary  in  the  development  of  computer 
algorithms  for  extraordinarily  long  sequences. 
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Computer  Review 

Computer  Algorithms  and  Intronic  Regions 

Since  the  mid-1970’s  when  rapid  sequencing  of  DNA  became  available,  researchers 
developed  computer  programs  to  handle  various  kinds  of  DNA  analyses.  With  the 
creation  of  HUGO  in  1988,  the  number  of  these  programs  has  proliferated.  Some  of 
these  programs  can  process  intronic  sequences  but  there  is  no  evidence  in  the  literature  of 
a  computer  algorithm  designed  specifically  to  aid  in  our  understanding  of  intronic 
regions.  Rather,  there  is  a  whole  class  of  algorithms  (used  primarily  for  DNA  mapping) 
that  seeks  to  identify  introns  so  that  they  can  be  removed  from  the  data,  keeping  only  the 
coding  regions.  Additionally,  there  is  another  whole  family  of  models  (called  multiple 
alignment  models)  that  cannot  handle  the  analysis  of  introns.  Only  single  alignment 
algorithms  (algorithms  that  determine  homology)  are  presently  useful  in  comparing 
intronic  regions.  But  even  these  have  some  constraints.  These  three  groups  of  models 
and  how  they  affect  our  knowledge  of  introns  are  discussed  below. 

Models  Incorporating  Features  of  Intronic  Regions  to  Aid  in  DNA  Mapping 

There  has  been  a  great  amount  of  DNA  mapping  in  the  past  10  years;  e.g.,  Andrzej 
Konopka  (Biolingua  Labs),  Gary  Stormo  (University  of  Colorado  at  Boulder)  and  Eric 
Snyder  (Sequana  Therapeutics,  Inc.)  have  worked  in  the  areas  of  mapping  and  non¬ 
coding  sequence  analysis  (Konopka  and  Smythers,  1987,  Konopka  et  al,  1987,  Stormo 
and  Snyder,  1993).  Most  of  this  research  is  aimed  at  identifying  features  of  intronic 
regions,  not  as  possible  identifiers  of  susceptibility  to  cancer,  but  for  subsequent  removal 
of  these  intronic  regions  in  long  sequences  of  unknown  DNA  data.  Now  that  rapid 
sequencing  techniques  are  available,  researchers  piece  together  long  sequences  of  DNA 
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from  a  designated  region  of  a  chromosome.  It  is  estimated  that  by  using  these  processes, 
the  whole  human  genome  will  be  sequenced  in  the  next  five  years  (Johnston,  1996). 

These  sequenced  chromosomal  regions  often  contain  numerous  unknown  genes. 
Researchers  try  to  determine  features  that  distinguish  introns  and  exons,  and  then  write 
computer  algorithms  to  flag  the  exons  and  introns  based  on  these  unique  features.  These 
algorithms  assess,  with  some  degree  of  probability,  the  location  of  the  numerous  genes 
along  a  chromosomal  DNA  sequence  (Snyder,  1995).  From  these  gene  sequences,  they 
determine  the  protein  for  which  the  gene  codes.  The  combination  of  computers  and 
laboratory  work  is  efficient  and  saves  considerable  time  and  money  so  that  more  genes 
can  be  found  and  mapped  in  a  short  time  period. 

These  algorithms  sometimes  include  observations  about  introns.  Unfortunately,  since 
it  is  not  the  main  focus  of  the  publication,  information  about  introns  may  be  fragmentary. 
The  specific  details  of  each  computer  program  used  to  detect  differences  between  exon 
and  intron  boundaries  relates  to  DNA  mapping  and  is  beyond  the  scope  of  this  thesis. 
Multiple  Alignment  Algorithms  and  Intronic  Regions 

This  group  of  computer  programs  takes  a  group  of  sequences  that  are  similar  in 
ftmction  and  creates  a  multiple  alignment  of  them.  This  alignment  is  stored  in  the 
database  as  a  matrix  called  a  “block”  and  future  unknown  sequences  can  then  be  matched 
against  these  blocks.  The  purpose  of  searching  a  database  of  blocks  is  to  detect  repeated 
domains  and  find  distinct  cross-family  relationships  that  had  been  missed  in  searches  of 
sequence  databases.  By  searching  the  data  in  block  form,  a  researcher  has  a  better  chance 
of  finding  a  short  sequence  because  the  blocks  focus  on  only  the  areas  of  high 

conservation  and  thus  eliminate  “noise”  in  the  sequences  (Henikoff  and  Henikoff,  1994). 
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Some  algorithms  used  for  this  method  are  PROTOMAT  (finds  the  best  path  for 
reconstructing  all  the  sequences  once  the  DNA  has  been  cut),  PATMAT  (takes  the  DNA 
portion  of  interest  and  searches  the  known  DNA  data  looking  for  areas  of  high 
conservation  (i.e.,  relatively  little  change  during  an  evolutionary  period  of  time)  and 
BLOCKSORT  (sorts  the  sequence  blocks  based  on  comparision  with  an  existing 
sequence  and  scores  them).  Some  block  searchers  can  be  accessed  from  an  electronic 
mail  server. 

There  are  two  basic  approaches  to  multiple  sequence  alignment  (Boguski,  1992). 
Global  alignment  attempts  to  match  a  group  of  sequences  along  their  entire  lengths.  It  is 
most  applicable  to  sequences  that  are  short  and  approximately  equal  in  length.  Local 
alignment  methods  are  better  suited  to  longer  sequences  that  vary  considerably  in  length 
and  share  only  small  isolated  regions  with  little  or  no  sequence  conservation,  or  for 
sequences  that  are  known  to  contain  internal  repeats. 

It  might  seem  as  though  these  types  of  programs  could  be  extended  easily  to  intronic 
sequences  or  even  to  complete  genes,  but  there  are  limitations  which  prohibit  this.  This 
methodology  depends  upon  the  fact  that  there  is  a  known  function  associated  with  a  given 
block  and  a  relatively  high  degree  of  conservation  between  the  family  members  of  a 
block.  Neither  of  these  features  is  generally  true  of  introns.  A  second  problem  is  that, 
although  one  of  these  algorithms  (PATMAT)  can  handle  a  sequence  of  5000  bp,  most  of 
them  are  designed  for  much  shorter  sequences.  Some  programs  cannot  exceed  even  200 
bp  in  length  before  they  start  “bogging  down”.  This  is  a  problem  for  introns,  as  they  tend 
to  be  very  long.  Third,  most  block  algorithms  have  very  limited  databases.  GenBank 
does  not  store  data  of  this  type.  Researchers  must  create  their  own  databases.  At  the 
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present  time,  these  databases  contain  only  proteins  and  there  is  no  block  intronic 
information  against  which  to  compare  a  new  sequence. 

A  similar  alternative  to  multiple  alignment  algorithms  is  a  modeling  methodology 
based  on  a  training  set  of  data.  There  is  overlap  between  these  models  and  models  that 
aid  in  mapping,  because  some  models  compare  multiple  genes  as  well  as  identify  new 
genes.  Models  like  neural  networks  (Snyder  and  Stormo,  1993)  and  hidden  Markov 
models  have  been  used  fairly  successfully  for  comparing  multiple  sequences  of  proteins 
or  coding  regions  (Krogh  et  al.,  1994,  Baldi  et  al.,  1994).  However,  these  models  require 
prior  knowledge  of  closely  homologous  genes  to  create  training  sets  for  the  inclusion  of 
future  genes;  such  prior  biological  knowledge  is  often  not  available  for  intronic  regions. 
In  addition,  these  models  are  prone  to  errors  even  when  there  are  good  data  (Krogh, 
1994).  Patterns  in  the  code  sometimes  cause  problems  for  these  algorithms  and  introns 
are  known  to  have  highly  repetitive  regions  that  create  unique  patterns.  Further,  these 
algorithms  tend  to  suppress  sequence  alignment  overlap  of  more  than  15  bp,  and  introns 
are  generally  very  variable  in  length.  All  of  these  issues  make  these  models  impractical 
for  identifying  intronic  homologies. 

Single  Alignment  and  Homology  Between  Intronic  Regions 

By  process  of  elimination,  single  alignment  techniques  are  the  present  method  of 
choice  when  it  comes  to  comparing  intronic  regions.  These  methods  are  probably  the 
most  commonly  used  for  any  type  of  DNA  analysis.  The  scenario  for  their  use  is  fairly 
consistent.  A  researcher  goes  into  the  laboratory  and,  with  the  aid  of  specific  primers, 
sequences  some  small  portion  of  the  human  genome,  yielding  a  clear,  readable  sequence, 
whose  purpose  is  unknown.  In  the  absence  of  an  expensive  DNA  computer  system,  an 


19 


automated  homology  algorithm  like  the  Basic  Local  Alignment  Search  Tool  (BLAST), 
available  free  via  the  INTERNET,  will  give  some  idea  whether  the  sequence  has  already 
been  identified. 

Easy  access  to  DNA  analysis  tools  has  helped  speed  the  pace  of  DNA  research,  but  it 
is  not  without  problems.  It  becomes  very  important  for  the  researcher  to  understand  the 
strengths  and  weaknesses  of  the  chosen  methodology.  By  design,  BLAST  does  not 
attempt  to  make  global  (end-to-end)  alignments  of  sequences  but  instead  focuses  on 
identifying  all  significant  local  similarities  across  a  pair  of  sequences  (Boguski,  1992). 
This  is  a  deterrent  if  one  has  a  long  intronic  sequence  and  wants  to  search  the  database  for 
another  similar  long  intronic  sequence.  BLAST  will  tend  to  truncate  both  sequences  in 
favor  of  a  short  region  of  high  homology  over  two  long  weak  alignments. 

Almost  all  the  single  alignment  algorithms  have  several  features  in  common  (whether 
they  are  commercial  or  governmental  in  origin).  Most  databases  rank  the  possible 
alignments  using  scoring  systems  which  vary  by  algorithm;  many  scoring  systems  created 
in  the  1980’s  are  now  inadequate  because  they  were  designed  for  much  simpler 
comparisons.  Sequences  used  to  be  smaller,  more  uniformly  conserved  and  single 
domain  protein  oriented.  A  scoring  system  designed  for  such  sequences  is  inadequate  for 
long  weakly-conserved  regions. 

Most  government  and  commercial  DNA  programs  offer  local  alignment  scoring  based 
on  a  system  that  sums  the  scores  for  aligned  pairs  and  then  scores  for  gaps.  Due  to  the 
lengths  of  the  sequences,  the  computer  processor  bogs  down  quickly.  It  becomes 
necessary  to  have  a  machine  with  a  parallel  architecture  to  compute  them.  To  reduce  the 
computational  time,  some  algorithms  (e.g.,  BLAST)  perform  the  local  alignment  and 
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simply  forbid  gaps.  Unfortunately,  sensitivity  to  weak  similarities  is  lost  because  the 
focus  becomes  on  smaller,  more  highly  conserved  regions.  Thus,  there  is  a  trade-off:  the 
more  complex  the  algorithm,  the  higher  the  price  in  terms  of  computation  time. 

Additionally,  even  those  systems  that  do  score  the  gaps  do  not  handle  introns  well. 
Different  types  of  gaps  are  important  to  introns.  Short  gaps  are  more  indicative  of  a  lack 
of  homology  than  longer  gaps.  Although  this  seems  rather  counterintuitive,  it  has  to  be 
remembered  that  DNA  folds  and  bends.  This  bending  and  folding  tends  to  occur  in 
intronic  regions.  As  such,  a  long  gap  may  not  indicate  a  lack  of  homology  because  the 
DNA  may  bend  around  and  realign  with  the  comparison  sequence  at  a  later  position. 
However,  when  the  current  algorithms  do  score  the  gaps,  they  weigh  every  mismatch  the 
same.  They  do  not  preferentially  treat  introns  on  the  basis  of  their  known  attributes. 

Another  problem  with  these  alignment  techniques  is  that  it  is  very  difficult  to  sort  out 
the  statistically  significant  scores  from  the  insignificant  ones.  This  problem  is  not  unique 
to  intronic  regions,  but  it  is  compounded  by  the  fact  that  intronic  regions,  by  nature,  tend 
to  be  more  weakly  conserved  in  nature.  This  means  there  will  be  more  weak  matches 
with  equivalent  scores,  assuming  that  one  has  even  captured  the  biologically  important 
sequences.  It  is  very  possible  with  these  algorithms  that  a  sequence  of  importance  is  lost 
among  all  the  irrelevant  or  chance  similarities.  The  longer  a  sequence,  the  greater  the 
score  expected  by  chance  (Altschul,  et  al.,  1994).  Thus,  even  though  this  is  the  best 
method  currently  available  (in  terms  of  cost,  availability,  and  ease  of  use),  it  has  some 
serious  deficiencies  when  used  to  investigate  homology  between  two  intronic  regions. 
This  hole  in  the  available  methodologies  to  characterize  and  align  gene  sequences  based 
on  their  intronic  regions  needs  to  be  filled. 


21 


CHAPTER  III 


DATA  COLLECTION 

Methodology 

Computer  Processing  and  Statistical  Applications  Software 

Data  collection  and  analyses  were  performed  on  a  MICRON  Pentium  90 
Powerstation.  DNA  sequence  data  were  collected  using  Network  Entrez,  a  TCP/IP-based 
client-server  version  of  World  Wide  Web  (WWW)  Entrez.  Network  Entrez  is  accessed 
via  the  WWW,  but  its  retrieval  capability  is  much  quicker  than  WWW  Entrez. 

Statistical  data  analyses  used  a  combination  of  S-Plus  for  Windows,  Version  3.2 
(Release  1)  and  SAS  for  Windows,  Version  6  (Release  6.10).  S-Plus  functions  were 
created  for  the  test  for  Markov  structure,  homology,  and  difference  of  means.  SAS 
programs  performed  the  cluster  analyses  using  PROC  CLUSTER  (SAS,  1988)  and 
MACRO  procedures  (SAS,  1989).  Appendices  D  and  E  contain  all  SAS  and  S-Plus 
programs  used  in  this  thesis. 

Criteria  for  Gene  Selection 

The  data  represent  an  availability  sample.  All  oncogenes,  tumor  suppressor  genes  and 
non-regulatory  genes  meeting  the  following  criteria  were  included:  (1)  the  gene  had  to 
have  partial  intronic  information  available  for  over  50%  of  its  introns,  (2)  a  given  intron 
had  to  contain  more  partial  intronic  information  than  merely  the  intron  ends  that  are  used 
to  match  the  primer  (these  can  be  easily  detected  because  they  tend  to  be  10  to  20  bp  long 
and  fall  on  either  side  of  each  exon),  (3)  when  only  a  partial  intron  was  sequenced,  either 
GenBank  or  a  published  article  had  to  provide  information  regarding  the  total  lengths  of 


the  introns  in  the  gene  for  over  50%  of  its  introns.  When  a  gene  had  alternative  splicing 
positions,  the  most  biologically  established  path  for  that  gene  was  chosen  (as  provided  by 
GenBank  or  a  published  paper). 

For  this  study,  it  was  deemed  more  important  to  have  valid,  more  complete 
information  for  fewer  genes  than  to  investigate  corrupted,  incomplete  information  for  a 
large  number  of  genes  (Weir,  1993). 

In  most  cases,  if  a  gene  included  any  intronic  data,  it  had  partial  sequence  data 
available  for  all  introns.  In  several  cases,  the  author  did  not  know  the  actual  length  of  one 
intron,  but  knew  the  lengths  of  over  50%  of  the  remaining  introns.  Of  greater  difficulty 
was  the  number  of  genes  with  no  intronic  information.  From  DNA  files  of  100 
oncogenes,  29  tumor  suppressor  genes,  and  40  non-regulatory  genes,  only  18  oncogenes, 
5  tumor  suppressor  genes  and  10  non-regulatory  genes  met  these  criteria.  More  data  are 
expected  to  be  available  in  the  next  several  years.  Hesketh  (1995)  provides  lists  of 
regulatory  genes. 

Table  3.1  displays  the  non-regulatory  genes  (labeled  (NR))  and  their  corresponding 
acronyms  in  the  data  set.  Table  3.2  displays  the  regulatory  genes.  This  table  includes  the 
most  prevalent  types  of  cancer  associated  with  each  gene.  Oncogenes  are  labeled  (0)  and 
tumor  suppressor  genes  are  labeled  (TS). 
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Table  3.1  Non-regulatory  Genes  and  Their  Abbreviations 


G6PD(NR) 

Glucose-6-phosphate  dehydrogenase 

PGK(NR) 

Phosphoglycerate  kinase 

ADOL(NR) 

Aldolase 

GTP(NR) 

Phosphoenolpyruvate  carboxykinase 

PFK(NR) 

Phosphofructokinase 

TPI(NR) 

Triose  phosphate  isomerase 

GCK(NR) 

Glucokinase 

Glyceraldehyde-3 -phosphate  dehydrogenase 

SDHB(NR) 

Succinate  dehydrogenase 

HKII(NR) 

Hexokinase 

(Smith  et  al.,  1983) 


Table  3.2;  Regulatory  Genes  and  Their  Most  Common  Carcinomas 


ABL(O) 

Chronic  myeloid  leukemia 

BCR(O) 

Chronic  myeloid  leukemia  and  Acute  lymphoblastic  leukemia 

FMS(O) 

Acute  myeloid  leukemia 

BCL-3(0) 

Chronic  lymphocytic  leukemia 

FOS(O) 

Osteosarcoma  and  Acute  lymphocytic  leukemia 

HST-l(O) 

Induce  blood  vessel  formation  and  are  synthesized  by  many  tumor  cells 

INT-2(0) 

Amplified  in  Breast  carcinoma  and  esophageal  carcinoma 

LCK(O) 

Renal  cell  carcinoma  and  chronic  lymphocytic  leukemia 

IVIYC(O) 

Small  cell  lung  carcinoma,  breast  and  cervical  carcinoma 

L-MYC(O) 

Breast  and  colon  carcinoma 

N-IVIYC(O) 

Neuroblastoma,  retinoblastoma  and  small  cell  lung  carcinoma 

IVIAX(O) 

NIH  3T3  fibroblasts,  HeLa  cells,  neuroblastoma-derived  cell  lines 

PIMl(O) 

Acute  myeloid  and  lymphoid  leukemias 

KIT(O) 

Small  cell  lung  carcinoma,  acute  myeloblastic  leukemia  blast  cells 

RAFAl 

Stomach  laryngeal,  lung  carcinomas  and  sarcoma 

FPS(O) 

Lung  carcinoma  and  hematopoietic  malignancies 

WNT-l(O) 

Retinoblastoma 

K-RAS2(0) 

Lung,  colon  and  pancreas  carcinomas 

MLHl(TS) 

Hereditary  nonpolyposis  colon  cancer 

MSH2(TS) 

Hereditary  nonpolyposis  colon  cancer 

MTSl(TS) 

Multiple  tumor  suppressor 

RB(TS) 

Retinoblastoma 

NFl(TS) 

Neurofibrosarcomas  and  malignant  Schwannomas 

(Hesketh,  1995) 


DNA  Sequence  Data  Collection 

Because  GenBank  is  a  retrieval  system,  not  a  relational  database,  its  use  for 
exploratory  analyses  is  labor  intensive.  Moreover,  most  files  in  GenBank  do  not  contain 
complete  intronic  information;  thus,  original  papers  verified  the  length  and  location  of  the 
introns  used  in  this  study.  Only  partial  intronic  information  is  often  available.  For 
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convenience,  an  author  may  have  split  the  DNA  code  across  an  intron.  Figure  3.1  depicts 
a  GenBank  sample  file. 


LOCUS  HSMYCC  8082  bp  DNA  PRl  27-JUL-1994 
DEFINITION  Human  c-myc  oncogene. 

ACCESSION  X00364  J00120  K01908  V00501 
NID  g34820 

KEYWORDS  myc  cellular  oncogene;  oncogene  cellular. 

SOURCE  human. 

ORGANISM  Homo  sapiens 

AUTHORS  Gazin, C.,  Dupont  de  Dinechin,S.,  Hampe,A.,  Masson,!. M.,  Martin,?., 
Stehelin,D.  and  Galibert,F. 

TITLE  Nucleotide  sequence  of  the  human  c-myc  locus:  provocative  open 
reading  frame  within  the  first  exon 
JOURNAL  EMBO  J.  3  (2),  383-387  (1984) 

MEDLINE  84182501 
REFERENCE  2  (bases  3507  to  7559) 

AUTHORS  Colby,W.W.,  Chen,E.Y.,  Smith, D.H.  and  Levinson,A.D. 

TITLE  Identification  and  nucleotide  sequence  of  a  human  locus  homologous 
to  the  v-myc  oncogene  of  avian  myelocytomatosis  virus  MC29 
JOURNAL  Nature  301  (5902),  722-725  (1983) 

MEDLINE  83141777 
COMMENT  NCBl  gi:  34820 
FEATURES  Location/Qualifiers 

source  L.8082 

/organism- 'Homo  sapiens" 

CDS  2304..2870 

/note="pot.  ORF  (aa  1-188);  NCBI  gi:  312410" 

/codon_start=l 

/transIation="MRGSGRLRTPELCCSRPPPPGPGRPWLPSCLEKGRASQRLGGKK 

* 

exon  <2304.. 2881 

/number=l 

intron  2882, .4505 

/number=I 

exon  4506. .5277 

/number=2 

CDS  join(452 1 .  .5277,6654.  .72 1 6) 

/note="NCBI  gi:  34821" 

/codon_start-l 
/product="48K  protein" 

/translation-"MPLNVSFTNRNYDLDYDSVQPYFYCDEEENFYQQQQQSELQ* 
intron  5278. .6653 

/number=2 

exon  6654^7657 

/number=3 

polyA_signal  7511. .7516 

/note="pot.  polyA  signal" 
polyA_signal  7652. .7657 

/note-"pot.  polyA  signal" 

BASE  COUNT  1850  a  2115  c  2135  g  1982  t 
ORIGIN 

I  agcttgtttg  gccgttttag  ggtttgttgg  aatttttttt  tcgtctatgt  acttgtgaat 
61  tatttcacgt  ttgccattac  cggttctcca  tagggtgatg  ttcattagca  gtggtgatag  * 

Figure  3.1:  Sample  GenBank  File  from  Entrez 

*  Amino  acid  translations  were  truncated  at  one  line. 

**  Nucleotide  sequence  data  was  truncated  at  120  bp. 
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Collection  of  Intronic  Length  Data 

Tables  of  intronic  lengths  were  created  in  Microsoft  Excel,  Version  5.0  and 
transported  to  S-Plus  and  SAS  as  text  files. 

Data  Limitations 

This  thesis  assumes  that  the  information  provided  by  GenBank  and  published  papers 
is  accurate.  To  the  extent  possible,  I  have  tried  to  use  the  most  current  GenBank  files  and 
verify  any  obvious  discrepancies.  I  found  two  obvious  mistakes  in  the  code,  nucleotides 
labeled  D  and  M.  Fortunately,  the  accompanying  papers  provided  the  correct  nucleotide 
information  and  these  errors  were  rectified.  In  several  other  cases,  the  published  paper 
differed  from  the  GenBank  sequence  length  by  one  or  two  nucleotides;  in  all  of  these 
instances,  sequence  numbers  were  provided  by  GenBank,  since  inaccurate  computations 
in  the  published  papers  accounted  for  the  discrepancy. 

Most  sequencing  is  accomplished  by  one  of  two  methods  —  the  Maxam  and  Gilbert 
DNA-sequencing  procedure  or  the  Sanger  DNA-sequencing  procedure  (Watson,  1992). 
Today’s  sequencing  process  often  includes  scanning  the  sequence  into  a  computer  file  as 
it  is  read  off  the  gel.  Although  there  is  much  progress  in  this  area,  sequencing  errors  still 
occur  for  a  variety  of  reasons  (Lipshultz,  1994).  Compression  artifacts  can  develop  when 
the  longer  strands  of  DNA  migrate  faster  than  the  shorter  strands.  Temperature  variations 
across  the  gel  (based  on  different  amoimts  of  thickness  caused  by  heat  dissipation)  can 
lead  to  uneven  mobility  within  a  band.  This  will  cause  the  band  to  broaden.  Broad  bands 
tend  to  overlap  and  are  more  difficult  to  read.  Fragments  longer  than  400-500  bp  are 
often  more  difficult  to  read. 
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Manual  trace  editing  can  find  some  errors,  but  this  takes  time  and  costs  money.  Some 
labs  have  now  incorporated  an  automated  trace  editing  (tedding)  tool  to  replace  the 
manual  process.  This  can  cheaply  eliminate  errors,  but  very  few  labs  have  this  capability 
(Lipshultz,  1994). 

Lipshultz  estimates  that  an  automated  sequencer  has  an  error  rate  of  4%  relative  to  the 
consensus  sequence.  Roberts  and  Posfai  developed  a  program  called  DETECT  (Roberts, 
1991).  Using  DETECT,  they  found  156  clear  errors  in  1.3  million  bases  due  to  frame- 
shift.  A  frame-shift  error  occurs  when  there  is  an  addition  or  substitution  of  a  nucleotide 
during  DNA  replication  such  that  the  normal  reading  frame  based  on  nuclotide  triplets  is 
shifted.  This  results  in  translation  of  the  wrong  sequence  of  amino  acids  after  the  point  in 
the  code  where  the  change  occurs  (King  and  Stansfield,  1990).  They  estimate  that  1%  of 
GenBank  has  frame-shift  errors  and  5%  has  substitution  errors. 

The  frame-shift  problem  applies  to  the  accurate  translation  of  the  nucleotide  code 
into  the  proper  amino  acids  based  on  grouping  of  three  nucleotides.  If  one  of  the  three 
nucleotides  is  deleted,  or  an  extraneous  nucleotide  is  added,  a  frame-shift  occurs.  Frame 
shift-errors  are  not  as  big  a  problem  for  intronic  nucleotide  sequence  data,  although  the 
code  still  has  an  incorrect  insertion  or  deletion.  But  the  4-5%  substitution  rate  error 
means  that  one  cannot  claim  that  DNA  sequence  data  entered  into  GenBank  is  flawless. 

There  is  also  a  degree  of  error  due  to  laboratory  estimation  of  intronic  lengths  for 
regions  that  have  not  been  sequenced.  The  lengths  provided  are  generally  rounded;  e.g.,  a 
sequence  of  length  1294  may  be  recorded  as  1300.  As  more  intronic  regions  are 
sequenced,  such  roimding  may  occur  less  frequently. 
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CHAPTER  IV 


INITIAL  CLUSTER  ANALYSES 

Methodology 

In  this  chapter,  I  investigate  whether  different  categories  of  genes  (i.e.,  regulatory  and 
non-regulatory  genes)  have  intronic  features  so  closely  related  that  they  will  cluster.  A 
cluster  is  a  set  of  objects  (e.g.,  genes)  that  resemble  each  other  more  than  they  resemble 
other  objects  not  in  the  cluster.  New  objects  joining  a  cluster  are  expected  to  have 
properties  similar  to  those  already  in  the  cluster. 

The  features  on  which  I  base  gene  clustering  in  this  thesis  are:  the  number  of  introns 
in  a  gene,  the  mean  of  the  log  lengths  for  introns  in  a  gene,  and  the  standard  deviations  of 
the  log  lengths  for  introns  in  a  gene. 

Classes  of  Clustering  Methods 

Clustering  methods  are  organized  into  two  classes;  partitioning  methods  and  hierarchical 
methods.  The  choice  of  clustering  method  depends  upon  the  underlying  purpose  behind 
the  research. 

In  partitioning,  the  data  are  classified  into  K  groups,  where  K  is  determined  in 
advance  (Kaufman,  1990).  A  problem  with  partitioning  is  that  the  algorithm  provides  the 
number  of  clusters  requested  by  the  researcher,  not  necessarily  the  optimal  number  of 
clusters.  Additionally,  partitioning  does  not  yield  insight  into  the  hierarchical 
relationships  between  objects  within  a  given  partition.  Some  common  types  of  partition 
methods  include  ^-means  clustering  and  fuzzy  algorithms. 


Hierarchical  clustering,  also  known  as  tree  structuring  (Hartigan,  1975),  is  a  series  of 
successive  fusions  of  the  attributes  into  groups  (Everitt,  1980).  Hierarchical  algorithms 
consider  all  possible  partitions  from  K  =  I  to  K  =  n  and  allow  visualization  of  the  intra¬ 
group  relationships  within  each  partition.  Since  the  biological  intra-group  relationships 
between  genes  are  important  to  this  thesis,  hierarchical  methods  are  used. 

There  are  two  kinds  of  hierarchical  techniques:  agglomerative  and  divisive  (Kaufman, 
1980).  These  methods  construct  the  hierarchy  in  opposite  directions.  Divisive  methods 
start  at  step  zero  with  all  the  objects  in  one  group.  In  each  subsequent  step,  a  cluster  is 
split  until  there  are  n  clusters  at  the  end  of  the  process.  Agglomerative  methods  start  at 
step  zero  with  n  clusters.  At  each  step  two  clusters  are  merged  until  there  is  one  cluster  at 
the  end  of  the  process.  The  agglomerative  technique  is  used  in  this  thesis. 

Clustering  Algorithm  Input  Structure 

Clustering  algorithms  use  one  of  two  input  structures,  a  two-mode  structure  and  a 
one-mode  structure  (Kaufman,  1990). 

Two-mode  Structure.  The  two-mode  structure  is  the  most  common.  The  first  step 
is  to  create  mnxp  coordinate  data  matrix,  where  n  is  the  number  of  objects  and  p  is  the 
number  of  features.  The  second  step  is  to  calculate  a  quantitative  distance  measure 
between  each  pair  of  objects  and  create  an  «  x  distance  matrix.  The  most  notable 
quantitative  distance  measure  uses  the  Euclidean  distance: 

k=\ 
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where  Xn^  is  the  value  of  the  A:'*  feature  for  the  i'^  object.  Euclidean  distances  will  be  used 
in  this  chapter.  Other  common  measures  are  squared  Euclidean  distance,  Manhattan 
distance  (or  city  block)  and  Minkowski  distance  (Everitt,  1980). 

One-mode  Structure.  This  structure  is  a  direct  collection  of  distances  based  on  all 
pairs  of  objects.  These  distances  combine  to  create  an  «  x  n  matrix.  One  mode  structures 
are  sometimes  created  in  place  of  the  common  distance  measures  discussed  above. 

A  common  misconception  about  cluster  analysis  is  that  it  always  must  begin  with  an 
object  X feature  matrix  (i.e.,  an  «  xp  matrix).  Many  times  the  one-mode  structure  is  more 
applicable.  In  this  thesis,  two-mode  structures  are  used  in  this  chapter  and  a  one-mode 
structure  is  created  in  chapter  6. 

Hierarchical  Clustering  Methods 

Hierarchical  clustering  methods  differ  primarily  in  how  distance  is  determined  from 
one  object  to  another  object  or  cluster.  However,  the  basic  steps  in  an  agglomerative 
hierarchical  clustering  are  common  to  all  methods  (Johnson  and  Wichem,  1988).  These 
are: 

(1)  Start  with  N  clusters  where  each  cluster  contains  a  single  object 

(2)  Create  quNxN  symmetric  matrix  of  distances,  D  =  {dnJ 

(3)  Find  the  nearest  pair  of  clusters  {U  and  V)  in  the  distance  matrix 

(4)  Designate  the  distance  between  clusters  U  and  V  as  duy 

(5)  Merge  clusters  U  and  V  to  form  (UV) 

(6)  Update  the  distance  matrix  by  deleting  the  rows  and  columns  associated  with  U 
and  V 
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(7)  Add  a  row  and  column  providing  the  distances  between  cluster  (UV)  and 
remaining  clusters 

(8)  Repeat  steps  3-7  N-1  times;  record  the  order  of  the  mergers. 

Choice  of  Clustering  Method 

One  of  the  hardest  decisions  in  a  cluster  analysis  is  the  choice  of  an  appropriate 
clustering  method.  No  clustering  method  is  optimal  in  all  situations  (Milligan,  1995), 
with  respect  to  its  ability  to  find  true  clusters  within  the  data.  The  researcher  is  faced 
with  numerous  methods,  all  with  inherent  strengths  and  weaknesses.  These  strengths  and 
weaknesses  have  been  identified  by  a  compilation  of  simulation  studies  that  have  been 
used  to  determine  situational  best  methods  based  on  different  data  scenarios  (Kuiper  and 
Fisher,  1975,  Milligan,  1995).  When  there  is  near  perfect  information  about  a  data  set,  it 
may  be  possible  to  choose  the  most  appropriate  method.  But  if  the  underlying  structure 
of  the  data  is  not  known  (as  often  occurs  in  exploratory  biological  analyses),  choice  of  a 
best  method  becomes  extremely  difficult.  As  a  solution,  it  is  a  recommended  that  one  use 
numerous  methods  and  look  for  a  consensus  among  results  (Everitt,  1980,  SAS,  1988, 
Johnson  and  Wichem,  1988,  Milligan,  1995).  Table  4.1  provides  a  description  of  the 
nine  clustering  methods  offered  by  SAS  that  will  be  employed  in  this  thesis.  If  the 
outcomes  from  the  various  methods  are  roughly  consistent  (i.e.,  there  is  a  consensus 
among  methods  concerning  cluster  members),  a  case  for  natural  groupings  can  be  made. 
This  approach  (i.e.,  drawing  conclusions  from  the  nine  clustering  methods)  is  followed  in 
this  thesis,  with  one  area  of  modification  based  on  results  from  Milligan’s  work. 

Milligan  (1980, 1985, 1995)  has  done  a  substantial  amount  of  research  in  the  area  of 
choosing  optimal  clustering  methodologies.  He  has  examined  a  number  of  methods 
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Table  4.1:  Description  of  Clustering  Methods  Used  in  This  Thesis 
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Table  4.1  (Continued):  Description  of  Clustering  Methods  Used  in  This  Thesis 
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(mostly  agglomerative  hierarchical  methods)  using  Monte  Carlo  simulations.  His 
validation  studies  are  designed  to  examine  the  effects  of  various  factors  on  cluster 
groupings  including:  inducing  error  on  the  data  points,  testing  different  similarity 
measures,  introducing  outliers  in  the  data  set,  assuming  different  population  distributions, 
evaluating  different  cluster  sizes,  and  varying  the  number  of  features  included  in  the 
clustering  process  (Milligan,  1995).  Based  on  his  validation  testing,  he  recommends 
Ward’s  minimum- variance  method,  and  the  Lance  and  Williams  beta-flexible  approach. 
Average  linkage  method  also  fared  well  in  validation  testing,  although  its  performance  is 
not  as  consistent  as  Ward’s  and  beta-flexible.  Single  linkage  consistently  performed 
poorly. 

In  this  thesis,  I  follow  the  published  advice  of  Everitt  (1980),  Johnson  and  Wichem 
(1988)  and  Milligan  (1995)  by  performing  cluster  analyses  for  all  methods  mentioned 
above.  I  use  the  results  from  all  methods  to  determine  the  proper  number  of  clusters,  but 
preferentially  emphasize  the  results  based  on  the  three  methods  recommended  by 
Milligan  (1995). 

Determining  the  Number  of  Clusters 

A  common  misconception  of  clustering  is  that  there  is  an  optimal  number  of  clusters. 
Most  clustering  methods  were  not  designed  to  determine  the  number  of  clusters  in  the 
data.  There  are  methods  for  investigating  this  issue,  but  there  is  no  guarantee  of  a 
consistency  between  methods  and  definitive  resolution  of  the  correct  number. 

It  is  tempting  to  determine  the  optimal  number  of  clusters  through  a  hypothesis 
testing  approach  to  assess  the  statistical  significance  of  a  cluster  structure  (Milligan  and 
Cooper,  1985),  e.g.,  via  analysis  of  variance  (ANOVA)  or  multiple  analysis  of  variance 
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(MANOVA).  Such  an  approach  is  inappropriate  because  the  results  will  almost  always 
conclude  significant  mean  differences,  even  when  the  data  are  random  noise  and  no  true 
clusters  are  present.  Cluster  analysis  inherently  disobeys  the  underlying  assumptions 
behind  ANOVA  and  MANOVA,  namely  that  the  data  are  independently  sampled  from  a 
multivariate  normal  distribution  and  that  the  objects  are  allocated  to  the  clusters 
randomly.  Hierarchical  clustering  meets  neither  of  these  assumptions.  Additionally, 
cluster  analysis  does  not  have  an  underlying  assumption  of  normality  (SAS,  1988,  and 
Milligan,  1995).  For  these  reasons,  standard  hypothesis  testing  methods  are 
inappropriate. 

There  have  been  numerous  attempts  in  the  last  thirty  years  to  address  the  optimal 
number  of  clusters  issue.  These  approaches  are  called  stopping  rules,  indicating  the  place 
where  one  is  to  stop  in  the  clustering  hierarchy.  A  comparative  study  of  thirty  attempts 
was  made  by  Milligan  and  Cooper  (1985).  Milligan  and  Cooper  simulated  data  that  had 
an  error  free  structure  and  in  which  distinct  clustering  was  present.  They  analyzed 
various  data  sets,  examining  the  ability  of  the  stopping  rules  to  detect  2,  3, 4,  or  5 
clusters.  Based  on  the  results  of  this  simulation  study,  Milligan  recommends  that  only 
the  top  five  stopping  rules  be  used  to  determine  the  proper  number  of  clusters.  Ideally,  at 
least  two  stopping  rules  should  be  used  simultaneously.  With  concurrence  between 
methods,  one  has  the  best  chance  of  determining  the  correct  number  of  clusters.  SAS 
incorporates  the  methods  ranked  first  and  second  from  Milligan  and  Cooper’s 
recommended  list,  namely  the  pseudo  F  and  the  pseudo 
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Pseudo  F.  The  formula  for  the  pseudo  F  at  a  given  level  in  the  hierarchy  is 


pseudo  F  =  f(T  -  Pg)/(G  -  l))/(Pa/(n-G)) 


where 


Fo-I. 


v=\ 


T. 


k^\ 


(where  is  the  if'  cluster) 


1=1  L*=i  J 

G  =  the  number  of  clusters  at  the  G'*  level  in  the  hierarchy  (SAS,  1988). 

An  equivalent  way  of  representing  the  pseudo  F  is  to  state  that  it  is  the  ratio  of  trace 
(sum  of  the  objects  on  the  main  diagonal)  of  the  between-cluster  sum  of  squares  matrix 
divided  by  (G-1)  and  the  trace  of  the  within-c luster  sum  of  squares  divided  by  (n-G).  F 
is  maximized  for  a  given  level  in  the  hierarchy  when  the  within-cluster  sum  of  squares 
decreases.  If  the  data  have  been  collected  independently  from  a  multivariate  normal 
distribution  and  randomly  allocated  to  groups,  the  pseudo  F  is  distributed  as  an  F 
distribution  with  p(G-l)  and  p(n-G)  degrees  of  freedom.  But  since  hierarchical  cluster 
analysis  does  not  require  a  normal  distribution,  nor  does  it  randomly  allocate  objects  to 
groups,  the  pseudo  F  is  not  distributed  as  an  F  distribution.  Nonetheless,  it  is  still  a  useful 
indicator  of  the  correct  number  of  clusters  (SAS,  1988).  The  pseudo  Fdoes  not  rely  on  a 
critical  value  to  determine  the  optimal  number  of  clusters  but  instead  relies  on  a  local 
maximization  of  the  pseudo  F  at  each  level  of  the  cluster  hierarchy.  The  chosen  pseudo 
F  for  the  optimal  number  of  clusters  will  be  higher  than  the  pseudo  Fs  for  the 
surrounding  clusters  (i.e.,  the  within  sum  of  squares  is  minimized). 
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Pseudo  The  pseudo  formula  for  joining  Cy  and  Cy  is 

pseudo  =  Buy  /  ((Wu  +  Wy)  /  (Ny  +  Ny  -  2)) 


where 

^UV  ~  ^UV  ~^U  ~^V  if  ^UV  =Cy'UCy 


k=\ 


Ny  ^  number  of  observations  in  Cy 


This  formula  creates  a  ratio.  The  denominator  is  the  sum  of  squares  for  error  when 
the  data  are  partitioned  into  two  clusters.  The  numerator  is  the  squared  error  of  the  new 
cluster  minus  the  squared  errors  of  the  two  separate  clusters.  The  pseudo  f  is  optimal 
when  the  ratio  is  minimized  (Milligan  and  Cooper,  1985).  The  pseudo  f  is  distributed  as 
an  F  random  variable  with  p  and  p(Ny  +  Ny- 2)  degrees  of  freedom  if  the  data  are 
independently  sampled  from  a  multivariate  normal  distribution  and  the  objects  are 
assigned  randomly  to  clusters  (SAS,  1988).  The  pseudo  f  is  computed  only  from 
information  in  the  latest  cluster  merger.  This  means  the  effective  sample  size  is  reduced 
compared  to  the  overall  sample  size,  so  the  statistic  may  suffer  from  a  lack  of  power  to 
detect  significant  differences.  However,  Milligan  and  Cooper’s  (1985)  simulation  studies 
determined  that  the  pseudo  f  consistently  performs  well  and  offers  an  effective  technique 
for  situations  where  the  sample  size  is  not  large. 

Selection  of  Variables 

One  of  the  beneficial  features  of  cluster  analysis  is  the  ability  to  combine  different 
types  of  variables  from  different  types  of  data  into  the  analysis.  Variables  can  be  counts, 
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ratios,  intervals,  ordinal  or  categorical  (Hartigan,  1975).  Cluster  analysis  can  also  handle 
missing  values.  In  cluster  analysis,  one  should  not  arbitrarily  bring  many  variables  into 
the  process  because  the  addition  of  even  one  or  two  irrelevant  variables  can  dramatically 
inhibit  cluster  recovery  (Milligan,  1 995). 

Decision  to  Standardize 

If  sizable  differences  in  means  and  variances  exist  between  variables,  standardization 
is  appropriate.  If  they  do  not  exist,  standardization  will  not  add  much  to  the  process 
(Milligan  and  Cooper,  1985).  In  this  thesis,  standardization  was  incorporated  based  on  a 
zero  mean  and  a  standard  deviation  of  one  (SAS,  1988). 

Criticisms  Levied  at  Cluster  Analysis 

Two  complaints  about  the  clustering  process  are:  (1)  cluster  analysis  is  not  based  on 
sound  probability  models,  and  (2)  results  are  poorly  evaluated  and  unstable. 

Hartigan  (1975)  counters  the  first  complaint  by  stating  that  most  statistical  analyses 
compare  predetermined  groups:  the  objects  are  assigned  to  the  groups  because  the 
researcher  assigned  them,  not  necessarily  because  the  objects  optimally  belong  in  those 
groups.  This  approach  works  fine  when  the  underlying  structure  of  the  data  is  known. 
But,  when  the  underlying  structure  is  unknown,  the  appropriate  use  of  these  methods 
becomes  questionable.  An  advantage  of  cluster  analysis  is  that  it  offers  a  methodology 
that  does  not  expect  predetermined  groupings,  providing  the  ability  to  observe  what  may 
be  closer  to  the  real  state  of  the  data  in  nature.  Since  an  underlying  structure  is  usually 
not  implied,  it  remains  credible  under  differing  data  structures. 

Although  individual  clustering  methods  may  appear  to  be  unstable  and  poorly 
evaluated,  the  stability  of  cluster  analysis  can  be  greatly  enhanced  by  looking  for 
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consensus  among  several  hierarchical  methods.  This  consensus  supports  the  assurance  of 
stability  that  a  single  hierarchical  method  cannot.  Additionally,  there  have  been 
numerous  simulation  studies  to  evaluate  hierarchical  models  xmder  diverse  scenarios; 
these  provide  additional  information  that  can  improve  the  stability  of  the  analysis. 
Application  of  Cluster  Analysis  to  Intronic  Features  Data 

Cluster  analyses  were  performed  for  all  combinations  of  the  three  biological  features. 
Of  the  seven  possible  combinations,  two  provided  biologically  significant  results  that  I 
will  further  discuss  in  this  thesis:  clustering  based  on  number  of  introns  in  a  gene,  and 
clustering  based  on  both  number  of  introns  and  mean  of  the  intronic  log  lengths  in  a  gene. 
The  standard  deviation  of  intronic  log  lengths  in  a  gene  did  not  result  in  biologically 
significant  clusters,  either  when  analyzed  separately  or  when  combined  with  other 
features. 

The  data  are  depicted  in  table  4.2.  The  number  of  introns  for  the  33  genes  in  the  data 
set  varies  dramatically.  The  range  is  1  to  56  introns  per  gene.  Plots  of  the  data  (truncated 
at  16,000  bp  for  better  visualization)  by  regulatory  group  are  provided  in  figures  A.1-A.4 
of  appendix  A.  Because  the  intronic  lengths  have  such  a  wide  range,  the  natural 
logarithm  was  used  as  a  transformation  function  where 

Zij  =  \n(X,j). 

Transformations  of  this  type  are  considered  acceptable  in  cluster  analysis  (Romesburg, 

1984).  Non-monotonic  transformations  are  not  acceptable.  They  can  change  the  number 

of  population  clusters  and  should  be  approached  with  caution  (SAS,  1988).  Simultaneous 

use  of  transformations  and  standardization  is  permitted  (Romesburg,  1984).  Plots  of  the 

transformed  data  by  regulatory  group  are  depicted  in  tables  A.  5- A.  8  of  appendix  A.  The 
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Table  4.2:  Total  Lengths  For  Introns  in  Each  Gene  (Per  Data  in  Published  Papers) 
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Table  4.2  (Continued):  Total  Lengths  For  Introns  in  Each  Gene  (Per  Data  in  Published  Papers) 
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variation  of  the  intronic  lengths  on  the  transformed  scale  is  greatly  reduced,  allowing  the 
transformed  attributes  to  contribute  more  equally  to  the  overall  similarity  between 
objects.  The  means  and  standard  deviations  of  the  intronic  log  lengths  for  all  33  genes 
are  shown  in  table  A.l  of  appendix  A. 

The  data  are  placed  in  an  nxp  features  matrix  where  n  is  the  number  of  objects  (i.e., 
the  33  genes)  and  p  is  the  number  of  features  (i.e.,  the  three  variables  of  biological 
interest).  All  variable  combinations  are  analyzed  by  the  same  clustering  process.  A  SAS 
MACRO  was  written  to  perform  the  nine  methods  outlined  in  table  4.1;  the  SAS  code 
provided  in  appendix  D.  Results  from  the  pseudo  F  and  pseudo  for  all  methods  are 
compared  to  determine  a  consensus  (when  possible)  as  to  the  optimal  number  of  clusters. 
Clustering  results  for  Milligan’s  three  preferred  methods  are  displayed  and  consensus 
concerning  cluster  members  is  evaluated  based  on  these  three  niethods. 

Clustering  Based  on  Number  of  Introns  in  a  Gene 

The  clustering  process  based  on  the  number  of  introns  in  a  gene  is  evaluated  for  the 
nine  clustering  methods  offered  by  SAS.  As  an  example,  table  4.3  depicts  an  edited  SAS 
output  for  the  average  linkage  method  based  on  the  number  of  introns  in  a  gene.  On  the 
left  side  of  the  table,  the  order  in  which  the  genes  enter  the  clusters  is  presented. 

Focusing  on  the  pseudo  F  column,  F  is  locally  maximized  (i.e.,  the  within-cluster  sum  of 
squares  is  minimized)  as  compared  to  the  surrounding  cluster  levels  when  the  number  of 
clusters  is  three  and  thirteen. 

The  pseudo  is  locally  minimized  when  the  within-cluster  sum  of  squares  is 
minimized.  One  should  look  for  dramatic  drops  when  evaluating  the  pseudo  In  this 
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case,  the  drops  when  the  number  of  clusters  is  3, 5, 6  and  10.  The  most  dramatic  drop 
occurs  when  the  number  of  clusters  is  3. 

There  are  a  large  number  of  ties  in  these  data,  but  this  is  not  an  uncommon  occurrence 
with  discrete  data  like  number  of  introns  in  a  gene.  A  tie  occurs  when  two  or  more  pairs 
have  the  same  minimum  distance.  PROC  CLUSTER  breaks  ties  based  on  the  numbered 
order  of  the  variables  in  the  data  set.  Each  tied  pair  has  a  smaller  ID  number  and  a  larger 
ID  number.  The  pair  with  the  minimum  larger  ID  number  is  chosen.  If  a  tie  still  results, 
the  pair  with  the  smallest  ID  number  is  fused  first  (SAS,  1988). 


Table  4.3:  SAS  Sample  Output  (Edited)  to  Demonstrate  Use  of  the 
Pseudo  F  and  Pseudo 


Average  Linkage 

Cluster  Analysis 

Number 

Frequency 

of 

of  New 

Pseudo 

Pseudo 

Clusters 

- Clusters  Joined - 

Cluster 

F 

t**2 

Tie 

32 

HST-1 (0) 

INT-2(0) 

2 

. 

T 

31 

CL32 

MYC(O) 

3 

T 

30 

CL31 

L-MYC(O) 

4 

. 

T 

29 

CL30 

N-MYC{0) 

5 

T 

28 

FOS  (0) 

WNT-1 (0) 

2 

T 

27 

FPS (0) 

MLHl (TS) 

2 

T 

26 

RAFAl (0) 

MSH2 (TS) 

2 

T 

25 

CL28 

MTSl (TS) 

3 

T 

24 

ABL{0) 

PGK(NR) 

2 

. 

T 

23 

BCL-3 (0) 

ADOL(NR) 

2 

T 

22 

CL23 

GTP(NR) 

3 

T 

21 

FMS  (0) 

PFK(NR) 

2 

T 

20 

CL22 

GAPDH(NR) 

4 

. 

19 

BCR(O) 

CL21 

3 

4292.13 

T 

18 

CL25 

CL29 

8 

1276.52 

T 

17 

CL24 

LCK(O) 

3 

1145.90 

T 

16 

PIMl  (0) 

K-RAS2 (0) 

2 

1123.43 

T 

15 

CL20 

GCK(NR) 

5 

1048.10 

T 

14 

TPI (NR) 

SDHB(NR) 

2 

1072.33 

T 

13 

CL27 

HKII (NR) 

3 

1078.99 

12 

CL19 

KIT(O) 

4 

1000.43 

4.00 

11 

CL18 

MAX(O) 

9 

929.47 

6.27 

10 

CL17 

G6PD(NR) 

4 

870.39 

6.25 

9 

CL15 

CL14 

7 

737.83 

15.88 

8 

CLll 

CL16 

11 

558.26 

18.84 

7 

CL26 

CL13 

5 

495.14 

38.40 

6 

CLIO 

CL9 

11 

353.38 

25.81 

5 

CL12 

CL7 

9 

254.75 

26.89 

4 

CL6 

CL8 

22 

105.60 

95.14 

3 

CL5 

RB(TS) 

10 

138.98 

7.36 

2 

CL4 

CL3 

32 

39.20 

105.99 

1 

CL2 

NFl(TS) 

33 

39.20 
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Ties  are  generally  not  desirable  because  the  level  in  the  cluster  hierarchy  where  the  tie 
occurs  and  sometimes  subsequent  levels  are  not  unique.  But,  the  magnitude  of  their 
impact  depends  upon  where  in  the  clustering  hierarchy  they  occur;  ties  early  in  the  cluster 
history  generally  have  little  effect  on  the  later  stages,  ties  in  the  middle  may  signal  the 
need  for  further  investigation,  and  ties  late  in  the  data  set  are  very  important.  They  can 
signal  indeterminacies  within  the  data  (SAS,  1988).  In  the  case  of  these  data,  the  ties 
occur  early  enough  in  the  clustering  process  that  they  do  not  pose  a  problem  for  the 
analysis. 

The  pseudo  F  and  pseudo  results  from  the  other  eight  clustering  methods  are 
examined  and  reported  in  table  4.4.  The  modal  number  of  clusters,  3,  is  taken  as  optimal 
in  this  analysis. 


Table  4.4:  Optimal  Number  of  Clusters 
Based  on  Number  of  Introns  in  33  Genes 


Method 

Pseudo  F 

Pseudo  t^ 

Ward's 

None 

4  or  9 

Beta-Flexible 

None 

4,7,9,12 

Average 

3,13 

3,5,6,10 

Centroid 

3,13 

3,5,6,10 

Complete 

3,13 

3,8,11 

EML 

None 

4,6,9,12 

McQuitty’s 

3,13 

3,7,11 

Median 

3,13 

3,7,11 

Single 

2,4,8,10,12 

2,4,8,10,12 

A  disadvantage  of  SAS  is  its  inability  to  graphically  display  the  clustering  results. 
PROC  TREE  is  difficult  to  format  and  interpret;  it  does  not  produce  the  dendrograms 
(two-dimensional  inverted  tree  structures  that  illustrate  fusions  at  each  successive  level  in 
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the  hierarchy)  commonly  associated  with  cluster  analysis.  To  overcome  this  limitation, 
PowerPoint  Version  4.0  was  used  to  depict  the  clustering  results. 

Clusterings  based  on  Ward’s,  beta-flexible  and  average  linkage  are  shown  in  figures 
4.1, 4.2,  and  4.3  respectively.  Each  method  places  the  same  genes  in  the  three  clusters. 
Thus,  there  is  100  percent  consensus  as  to  cluster  members  based  on  the  three  methods 
for  the  3  cluster  model. 
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Cluster  3 
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Figure  4.1:  Clusters  Based  on  Number  of  Introns  Using  Ward’s  Minimum  Variance  Cluster  Analysis 


Cluster  3 
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Figure  4.2:  Clusters  Based  on  Number  of  Introns  Using  Beta-Flexible  Cluster  Analysis 


Cluster  3 
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Figure  4.3;  Clusters  Based  on  Number  of  Introns  Using  Average  Linkage  Cluster  Analysis 


Interpretation  of  Clustering  Based  on  Number  of  Introns  in  a  Gene 

Observing  the  three  clusters  created  by  each  method  depicted  in  figures  4.1  -  4.3, 
cluster  1  contains  genes  with  1  to  12  introns.  Cluster  2  contains  genes  with  15  to  26 
introns.  Cluster  3  contains  one  gene  with  56  intronic  regions.  Within  cluster  1,  there  are 
two  well  defined  partitions.  The  first  partition  consists  of  almost  all  oncogenes  (tumor 
suppressor  MTSl  is  the  exception).  Genes  in  this  partition  contain  1  to  5  introns.  Eight  of 
the  ten  non-regulatory  genes  fall  into  the  second  partition;  these  genes  contain  from  6  to 
12  introns. 

Within  cluster  2,  there  is  more  of  a  gene  mixture  with  two  distinct  partitions.  The 
number  of  introns  in  genes  from  the  first  partition  range  from  15  to  18.  This  group  of 
genes  contains  two  tumor  suppressor  genes,  two  oncogenes  and  one  non-regulatory  gene. 
The  genes  in  the  second  partition  contain  primarily  21  to  22  ihtrons.  The  retinoblastoma 
{RB)  gene  has  a  greater  number  of  intronic  regions  (26)  than  other  cluster  members  in  this 
partition.  When  observing  the  average  linkage  method,  RB  is  slightly  removed  from  the 
two  partitions  within  cluster  2;  the  partitions  merge  before  joins  the  cluster.  This 
degree  of  differentiation  is  not  apparent  in  the  other  two  clustering  methods.  However, 

RB  is  not  different  enough  from  other  members  of  cluster  2  that  it  instead  joins  cluster  3. 
This  signifies  that  tumor  suppressor  peripheral  neurofibromatosis  (NFl)  is  different 
enough  from  the  other  genes  (i.e.,  it  has  56  introns)  that  it  forms  its  own  cluster.  In  fact, 
all  other  genes  in  the  data  set  fuse  into  a  cluster  before  NFl  joins  the  cluster  at  the  last 
level  in  the  hierarchy. 

A  partition  that  includes  8  of  the  10  non-regulatory  genes  appears  non-random.  This 
suggests  that  number  of  intronic  regions  may  be  somewhat  different  between  non- 
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regulatory  and  regulatory  genes.  A  partition  where  10  out  of  1 1  genes  are  oncogenes  also 
appears  non-random. 

Focusing  on  relationships  within  this  oncogene  partition,  it  is  interesting  to  note  that 
the  HST-1  and  INT-2  genes  cluster  together.  Both  are  members  of  the  fibroblast  growth 
factor  family  and  are  considered  related  based  on  their  protein  function  (Hesketh,  1995). 
The  MYC  genes  join  this  partition  along  with  the  MAX  gene.  The  human  MYC  genes  and 
the  gene  are  also  considered  to  be  closely  related  in  terms  of  protein  function 
(Hesketh,  1995).  The  MYC,  MYC-N  and  MYC-L  proteins  are  helix-loop-helix/leucine 
zipper  proteins  that  form  sequence-specific  DNA  binding  heterodimers  with  the  MAX 
protein  (Hesketh,  1995).  Although  all  of  these  genes  are  not  members  of  the  same 
family,  they  all  have  some  level  of  involvement  as  related  to  fibroblast  (spindle-shaped 
cells  responsible  for  the  formation  of  extracellular  fibers  such  as  collagen  in  connective 
tissues)  growth  or  transformation  (King  and  Stansfield,  1990,  Hesketh,  1995). 

Even  genes  in  this  partition  that  are  farther  apart  seem  to  have  closer  functional 
relationships  to  the  genes  within  the  partition  than  to  other  genes  in  the  data  set.  For 
example,  KRAS2  does  not  transform  normal  fibroblasts  but  transforms  transfected  (cells 
that  have  acquired  foreign  DNA)  fibroblasts  when  supplemented  with  oncogenes  like 
MYC.  PIM-1  also  has  this  relationship  with  MYC.  WNT-1  is  known  to  transform 
fibroblasts,  but  with  low  efficiency.  However,  it  is  known  to  play  a  role  in  the  fibroblast 
transformation  process  through  a  paracrine  mechanism.  Overexpression  of  FOS  does  not 
transform  human  fibroblasts,  but  rat  FOS  does  transform  rat  fibroblasts  under  certain 
conditions  .  There  is  no  currently  known  relationship  between  WNT-I  and  FOS 


(Hesketh,  1995). 
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Ward’s  and  beta-flexible  both  recommended  four  clusters  rather  than  three  clusters. 
The  preponderance  of  evidence  from  the  other  methods  did  not  support  four  clusters,  but 
the  possibility  of  four  clusters  should  not  be  completely  eliminated  from  consideration. 
Four  clusters  would  formally  break  cluster  1  into  the  two  partitions  discussed  above.  It 
is,  however,  equally  possible  that  Ward’s  is  showing  a  bias  towards  clusters  of  equal  size. 
Another  item  of  interest  is  that  single  linkage  cluster  analysis  seems  to  digress  from  the 
other  methods.  This  result  supports  Milligan’s  (1995)  observations  concerning  single 
linkage  clustering  based  on  his  simulation  studies;  that  single  linkage  consistently 
performs  poorly. 

Regardless  of  slight  variation  in  the  optimal  number  of  clusters,  the  biological 
information  of  interest  for  this  variable  remains  unchanged.  There  appears  to  be  a 
relationship  between  number  of  introns  in  a  gene  and  the  gene’s  regulatory  status. 
Clustering  Based  on  Number  of  Introns  and  Mean  Intronic  Log  Lengths  in  a  Gene 

The  clustering  process  based  on  both  the  number  of  introns  in  a  gene  and  the  mean 
intronic  log  lengths  in  a  gene  is  evaluated  for  the  nine  clustering  methods  offered  by 
SAS.  Ties  did  not  occur  in  the  SAS  output  for  this  combination  of  variables.  The  pseudo 
F  and  pseudo  results  are  reported  in  table  4.5.  There  is  a  strong  consensus  that  10 
clusters  is  optimal. 


52 


Table  4.5:  Optimal  Number  of  Clusters 
Based  on  Number  of  Introns  and  Means 
of  Intronic  Log  Lengths  for  33  Genes 


Method 

Pseudo  F 

Pseudo  t‘‘ 

Ward's 

10 

2,4,8,10,12 

Beta-Flexible 

5,10 

4,8,10,14 

Average 

5,10 

3,4,7,13 

Centroid 

5,9,11 

3,4,7,11,12 

Complete 

10 

4,7,10,12 

EML 

10 

3,4,7,13 

McQuitty's 

5,10 

4,7,10,13 

Median 

3,5,8 

3,6,11,13 

Single 

8 

2,8,11 

Clusterings  based  on  Ward’s,  beta-flexible  and  average  linkage  are  shown  in  figures 
4.4, 4.5,  and  4.6  respectively.  Each  method  places  the  same  genes  in  the  ten  clusters. 
Thus,  there  is  100  percent  consensus  as  to  cluster  members  based  on  the  three  preferred 
methods. 


53 


Cluster  8 


Figure  4.4:  Clusters  Based  on  Number  of  Introns  and  Means  of  Intronic  Log  Lengths  Using  Ward’s  Minimum  Variance 
Cluster  Analysis 


Cluster  1 


Figure  4.5:  Clusters  Based  on  Number  of  Introns  and  Means  of  Intronic  Log  Lengths  LFsing  Beta-Flexible 
Cluster  Analysis 


Cluster  4  Cluster  8 


Figure  4.6:  Clusters  Based  on  Number  of  Introns  and  Means  of  Intronic  Log  Lengths  Using  Average  Linkage  Cluster 
Analysis 


Interpretation  of  Clustering 

The  three  MYC  genes  cluster  together.  MYC  and  N-MYC  cluster  very  early  in  the 
process,  with  Z-A/7C  joining  later.  The  early  clustering  ofHST-1  and  WNT-1  is  notable 
since  HST-1  is  known  to  be  a  fibroblast  growth  factor  and  WNT-1  is  also  believed  to  be  a 
growth  factor  with  an  unknown  receptor.  Proteins  from  both  of  these  genes  may  have  a 
paracrine  mechanism.  However,  lNT-2  is  also  a  member  of  the  fibroblast  growth  factor 
family  and  it  does  not  cluster  with  these  two.  It  instead  clusters  with  K-ras  later  in  the 
clustering  hierarchy.  This  may  be  biologically  significant  because  both  INT-2  and  K-ras 
play  a  role  in  the  transformation  of  NIH  3T3  cells.  NIH  3T3  cells  transformed  by  mouse 
INT-2  cDNA  express  a  series  of  INT-2-re\ated  proteins;  no  mention  was  made  as  to 
whether  this  is  also  true  in  humans  (Hesketh,  1995).  Normal  fibroblasts  are  not 
transformed  by  RAS  oncogenes,  but  NIH  3T3  fibroblasts  are  transformed  by 
overexpression  of  normal  RAS  proteins.  Thus,  it  is  biologically  plausible  that  these  two 
genes  may  share  a  more  distant  relationship. 

The  early  clustering  of  KIT  and  FMS  is  also  of  interest,  especially  since  these  have 
previously  been  identified  as  being  structurally  similar  (Hesketh,  1995).  RAFAl  is  a 
member  of  the  SRC  super-family  of  protein  kinases.  FPS  is  known  to  have  C-terminal 
homology  with  SRC  (Hesketh,  1995).  LCK  is  also  a  member  of  the  SRC  family  and  joins 
with  FPS  and  RAFAl  later  in  the  clustering  process.  These  two  are  not  as  closely 
clustered  as  some  of  the  other  gene  relationships,  but  it  may  be  that  they  share  a  more 
distant  relationship.  There  is  also  a  small  cluster  of  only  non-regulatory  genes,  and 
several  small  clusters  containing  almost  exclusively  oncogenes. 
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Conversely,  oncogene  LCK  and  non-regulatory  gene  G6PD  cluster  closely  for  no 
apparent  biological  reason.  The  same  is  true  of  MAX  and  MTSl.  Although  AfTSl  is  a 


multiple  hunor  suppressor  and  could  therefore  influence  many  genes,  there  is  not 
apparent  reason  why  it  would  cluster  more  closely  with  MAX  (which  involves  NIH  3T3 
fibroblasts,  HeLa  cells,  neuroblastoma-derived  cell  lines)  than  other  oncogenes.  This  is 
additionally  complicated  by  the  fact  that  MAX  is  known  to  be  related  to  the  MYC  family 
and  yet  it  did  not  cluster  with  this  family.  MAX  did,  however,  cluster  with  WNT-1 , 
HST-1,  FOS  and  BCL-3.  With  the  exception  of  BCL-2,  all  of  these  have  relationships 
that  have  previously  been  discussed.  Little  is  known  about  BCL-3,  so  it  is  difficult  to 
determine  where  this  gene  belongs. 

In  conclusion,  these  clusters  suggest  that  there  may  be  a  relationship  based  on  number 
of  introns  and  intronic  log  lengths  of  the  introns.  There  are  some  genes  that  are  not  close 
when  observing  clusters  based  on  the  two  variables  separately,  but  seem  to  move  closer 
when  these  two  variables  combine.  It  is  also  interesting  that  the  optimal  number  of 
clusters  indicates  more,  smaller  groups. 

It  may  be  that  we  would  see  more  (or  fewer)  relationships  if  more  data  were  available. 
Meanwhile,  the  clustering  provides  at  least  partial  evidence  of  non-randomness. 
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CHAPTER V 


TEST  FOR  INDEPENDENCE  VERSUS  A  FIRST-ORDER  MARKOV  PROCESS 
Methodology 

DNA  can  be  considered  as  a  stochastic  process  Z,  where  each  observation 
assumes  one  of  four  states:  A,  G,  C,  or  T  corresponding  to  the  four  nucleotides.  liX^  -  i, 
the  process  is  said  to  be  in  state  i  at  time  m.  Whenever  the  process  is  in  state  i,  there  is  a 
fixed  probability  py  that  the  process  will  next  be  in  state  j.  This  is  expressed  as 

P{X,„^,  =  j\X,„  =  i}  =  py  (Ross,  1989) 

for  all  states  i,  j  and  all  m  >  0 .  In  this  situation,  the  DNA  sequence  may  be  described  by 
a  first-order  Markov  chain  model.  In  words,  the  above  equation  states  that,  for  a  Markov 
chain,  the  conditional  distribution  of  any  state  given  the  past  states 
Xq,  Aj  ,  . . . ,  X^_^ ,  and  the  present  state  X^,  is  independent  of  the  past  states  and  depends 
only  upon  the  present  state,  py  is  one-step  transition  matrix  and  the  Markov  chain  it 
represents  is  said  to  be  of  first-order.  Properties  of  py  are 

s 

Py  >  0  where  i  and y  are  >  0 ,  and  ^  Py  =  1 ,  where  z  =  1 , 2, . . . ,  (Ross,  1 989) 

7=1 

where  s  is  the  number  of  states.  A  Markov  chain  is  said  to  have  order  zero  if  it  is  a 
sequence  of  independent  random  variables,  i.e.,  py  does  not  depend  upon  z. 

The  idea  that  a  DNA  sequence  (especially  the  intronic  regions)  might  be  represented 
by  a  Markov  chain  has  been  discussed  in  the  literature  without  resolution.  It  seems 
appropriate  to  use  a  statistical  hypothesis  testing  methodology  to  investigate  this  issue. 


Billingsley  (1961)  provides  a  Chi-square  test  for  a  null  hypothesis  that  the  data  are 
independent  (i.e.,  a  zero-order  Markov  process)  against  the  alternative  hypothesis  that  the 
data  come  from  a  first-order  Markov  process.  This  test  will  be  applied  to  the  data  in  this 
thesis  to  investigate  whether  introns  within  a  gene  show  evidence  of  independence  or  a 
first-order  Markov  process.  An  examination  of  introns  within  a  gene  has  not  previously 
been  investigated  in  the  literature  and  it  seems  appropriate  to  test  for  evidence  of  a  first- 
order  Markov  process  before  performing  the  test  of  homology  in  Chapter  VI. 

To  test  whether  an  observed  sequence  is  really  an  independent  sequence,  let  { x, 
x„  j  be  a  sample  from  a  first-order  Markov  process  with  transition  probabilities  py  The  s  x 
s  matrix  F=  {^j}  provides  the  transition  count  of  the  sequence,  i.e.,  the  number  of 
transitions  from  state  i  to  state  j.  The  hypotheses  are 

Hq  :  Pij  =  Pj  is  independent  of  i  for  all  i  versus  :  py  is  not  independent  of  i  for  some  i. 
The  probability  of  obtaining  a  particular  sequence  begins  at  x,  and  has  a  transition  count 
matrix  F.  Let 

fi.=Y.jfv  and 4- 

{fjis  the  number  of  transitions  out  of  state  i  (where  f  =/,  =/J)  and  {fj}  is  the  number 
of  transitions  into  state  j.  This  means 


60 


The  following  statistic  can  now  be  utilized 


(Billingsley,  1961) 


r  fj,/(n-l) 

which  is  chi-square  with  s(s-l)-(s-l)  =  (s-lf  degrees  of  freedom  for  a  first-order  Markov 
process. 

It  can  be  shown  that  this  chi-square  statistic  is  approximately  a  Neyman-Pearson 
likelihood  ratio  test 


y  (•^i'  ^  2  y  f  In 


(Billingsley,  1961) 


In  the  limit,  each  formula  has  a  Chi-square  distribution  with  (s-lf  degrees  of  freedom. 
Thus,  the  Chi-square  test  for  zero-order  versus  first-order  in  a  Markov  process  is 
asymptotically  equivalent  to  performing  a  likelihood  ratio  test  based  on  multinomial 
distributions. 

Application 

As  an  example,  the  first-order  transition  counts  for  the  first  intron  of  BCL-2>  is  shown 
in  table  5.1.  The  length  of  the  first  intron  of  BCL-3  is  1355  bp.  Matrix  F  is  provided  as 
follows: 


Table  5.1:  First-order  Transition  Count 
for  Intron  1  of  BCL-3 


m 

A 

G 

C 

T 

A 

76 

143 

61 

73 

353 

G 

127 

194 

54 

70 

445 

C 

105 

13 

100 

73 

291 

T 

45 

353 

95 

445 

76 

291 

49 

265 

265 

1354 

61 


The  Chi-square  test  statistic  is 


175.51, 


which  is  much  larger  than  xl  =  21.666  (where  a  =  0.01  and  the  degrees  of  freedom  are 

computed  from  (s-lf  =  (4-1)^  =  9).  A  small  a  was  chosen  to  provide  a  stringent 
hypothesis  test.  Thus,  the  null  hypothesis  of  independence  is  rejected  and  it  is  concluded 
that  the  intron  1  of  BCL-3  sequence  more  consistent  with  a  first-order  Markov  process 
than  with  independence. 

An  F  transition  matrix  is  created  for  all  375  introns  in  the  33  genes.  Not  all  intronic 
information  is  available  for  every  intron  in  every  gene.  Unfortunately,  with  intronic 
regions,  the  actual  location  of  the  missing  data  is  often  unknown.  Table  B.l  depicts  the 
amount  of  intronic  sequence  data  available.  The  S-Plus  code  is  provided  in  appendix  E. 

Table  5.2  depicts  the  results  for  the  375  introns  in  the  data  set.  White  cells  represent 
independent  intronic  regions.  Light  gray  cells  represent  intronic  sequences  showing 
evidence  of  a  first-order  Markov  process. 

Sixty-seven  percent  of  the  introns  show  evidence  of  a  first-order  Markov  process.  It 
is  very  interesting  to  note  that  a  given  gene  may  have  varying  degrees  of  independence. 
This  may  explain  why,  in  the  biological  literature,  there  are  mixed  results  as  to  whether 
there  is  evidence  of  a  Markov  process  in  sequences.  From  only  a  random  sample,  it  is 
possible  to  obtain  mixed  results. 
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Table  5.2:  Results  From  Test  of  Independence  Versus  Markov  Structure  For  Introns  in  Each  Gene 
(Gray  Cells  Reject  at  the  a  =  .01  Level) _ 
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Table  5.2  (Continued):  Results  for  Test  of  Independence  Versus  Markov  Structure  For  Introns  in  Each  Gene 
(Gray  Cells  Reject  at  the  a  =  .01  Level)  _ 
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Table  5.2(Coiitinued):  Results  For  Test  of  Independence  Versus  Markov  Structure  For  Introns  in  Each  Gene 
(Gray  Cells  Reject  at  the  a  =  .01  Level)  _ 
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GAPPH(NR) 

SPHB(NR) 


Table  5.3  displays  the  percentage  of  intronic  regions  that  show  evidence  of  a  first-order 
Markov  process  by  gene. 


Table  5.3  Percentage  of 
Intronic  Regions  Showing 
Evidence  of  a  Markov 
Process 


Gene 

Percent  (%) 

ABL(O) 

100 

BCR(0) 

100 

FMS(O) 

90.5 

BCL-3(0) 

62.5 

FOS(O) 

66.7 

HST-l(O) 

100 

INT-2(0) 

100 

LCK(O) 

45.5 

MYC(O) 

100 

L-MYC(O) 

50 

N-MYC(O) 

100 

MAX(O) 

100 

PIMl(O) 

60 

KIT(O) 

45 

RAF  A  1(0) 

80.0 

FPS(O) 

83.3 

WNT-l(O) 

100.0 

K-RAS2(0) 

100.0 

MLHl(TS) 

66.7 

MSH2(TS) 

40.0 

MTSl(TS) 

100.0 

RB(TS) 

46.2 

NFl(TS) 

50.0 

G6PD(NR) 

100.0 

PGK(NR) 

70.0 

ADOL(NR) 

100.0 

GTP(NR) 

62.5 

PFK(NR) 

38.1 

TPI(NR) 

83.3 

GCK(NR) 

88.9 

GAPDH(NR) 

50.0 

SDHB(NR) 

85.7 

HKII(NR) 

58.8 
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Two  additional  issues  of  interest  are:  (1)  the  power  to  detect  statistically  significant 
differences,  and  (2)  multiple  testing.  To  investigate  whether  all  the  independent 
sequences  were  only  the  shortest  ones,  I  superimposed  the  gray  cells  over  the  intronic 
lengths.  This  is  depicted  in  table  B.l .  There  are  some  short  sequences  that  are 
independent,  but  there  are  also  some  very  long  sequences  that  are  clearly  independent. 
Similarly  there  are  some  fairly  short  sequences  that  show  evidence  of  a  Markov  process. 
Although  there  is  a  small  sample  power  issue  here,  it  does  not  seem  to  be  a  severe 
problem. 

For  the  multiple  testing  problem,  when  testing  at  the  1  percent  level,  one  expects  by 
chance  alone  to  have  1  percent  of  the  cells  be  significant.  Clearly,  there  are  many  more 
cells  that  reject  the  null  hypothesis.  There  really  was  no  way  to  avoid  this  problem. 

Since  the  intronic  regions  are  processed  individually  in  the  body,  it  would  not  be 
appropriate  to  link  them  together  (as  is  often  done  with  the  exonic  regions).  This  means 
that  it  is  not  possible  to  do  a  hypothesis  test  of  the  whole  DNA  sequence  before  testing 
the  individual  intronic  subsequences.  This  is  the  approach  recommended  in  Zerbe  and 
Murphy  (1986),  but  it  cannot  be  incorporated  herein.  Thus,  it  must  be  acknowledged  that 
about  intronic  regions  may  show  a  first-order  Markov  process  simply  by  chance. 

Now,  having  sufficient  evidence  of  Markov  process  for  a  majority  of  introns,  I  can 
proceed  to  test  the  homology  (similarity)  of  the  introns. 
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CHAPTER  VI 


The  asymptotic  distribution  has  s(s-l)  degrees  of  freedom  as  oo ,  where  s  (the  number 
of  states)  is  fixed.  This  formula  can  also  be  extended  to  n  samples.  In  this  case,  the 
degrees  of  freedom  is  (n-l)s(s-l). 

Combining  Chi-square  Test  Statistics 

Having  computed  the  Chi-square  test  statistics  for  all  corresponding  intron  pairs 
between  two  genes,  the  mean  of  the  Chi-square  test  statistics  is  computed.  This  provides 
a  global  relationship  for  the  intronic  regions  between  two  genes.  This  mean  becomes  a 
measure  of  distance  that  is  used  in  a  distance  matrix  to  perform  a  cluster  analysis.  This 
distance  matrix  is  taken  directly  from  the  relationships  in  the  data  and  is  an  example  of  a 
one-mode  structure.  There  is  no  evidence  in  the  literature  of  a  test  statistic  being  used  as 
a  distance  measure.  This  type  of  distance  measure  may  provide  informative  clustering 
information  than  using  features  information. 

Billingsley’s  homology  statistic  is  applied  to  the  corresponding  intronic  regions  of 
DNA  from  each  gene  pair.  Corresponding  intronic  regions  were  chosen  because,  in  vivo, 
the  introns  are  transcribed  in  order  in  the  nucleus.  It  is  appropriate  to  compare  intronic 
region  1  from  gene  1  with  intronic  region  1  from  gene  2,  and  then  intronic  region  2  from 
gene  1  with  intronic  region  2  from  gene  2,  etc.,  until  all  paired  intronic  regions  have  been 
completed  out  to  the  number  of  introns  in  the  gene  which  has  the  fewest. 

This  approach  differs  from  the  usual  scores  based  on  percent  match  found  in  DNA 
analysis.  All  the  available  intronic  information  is  used  to  compute  the  transition  matrices 
for  each  region.  There  is  no  truncation  of  the  DNA  data  from  an  intronic  region.  We  are 
observing  ordered  trend  for  all  the  data  in  the  DNA  sequences  rather  than  alignment. 

This  seems  appropriate  in  light  of  the  fact  that  intronic  regions  are  knotvn  to  be  less 
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conserved  and  have  more  gaps.  The  ordered  relationships  will  generally  not  be  as 
apparent  as  those  found  in  the  coding  regions.  This  is  a  different  way  to  determine 
statistical  homology. 

The  advantage  here  is  that  we  have  an  actual  hypothesis  testing  methodology  which  is 
lacking  in  the  current  sequence  analysis  methodologies.  The  hypotheses  are 

Hq:  py  =  qy,  for  all  ij  (i.e.,  the  two  intronic  regions  are  homologous) 

versus 

Hp  Py  *  qy ,  for  some  i,j 

where  py  and  qy  are  the  transition  probability  matrices  for  each  pair  of  intronic  regions 
from  each  gene  pair. 

Application 

As  an  example,  suppose  intron  1  from  5CL-3  is  again  chosen.  The  transition  count  is 
provided  below  in  table  6. 1 .  Intron  1  from  BCL-7)  is  compared  with  intron  1  from 
FOS.  The  transition  counts  for  FOS  are  given  in  table  6.2. 


Table  6.1:  First-Order  Transition 
Count  for  BCL-3,  Intron  1 


A(l) 

G(2) 

C(3) 

T(4) 

A(l) 

76 

143 

61 

73 

G(2) 

127 

194 

54 

70 

C(3) 

105 

13 

100 

73 

T(4) 

45 

95 

76 

49 

Table  6.2:  First-  Order  Transition 
Count  for  FOS,  Intron  1 


ran 

^1 

mm 

Bill 

34 

60 

31 

24 

G(2) 

58 

100 

69 

33 

C(3) 

41 

50 

51 

54 

T(4) 

16 

50 

45 

36 

Since  the  corresponding  transition  count  matrices  have  dimension  4  x  4,  the  degrees  of 
freedom  will  be  s(s-l)  =  4(4-1)  =  12.  When  a  =  0.01,  the  critical  point  is  26.217. 


2  V'  fiSi  f  f'J 
X  “2-1 ‘Z"; - 7  — 


=  82. 1 5092  for  BCL-3  and  FOS  introns  1 . 


The  null  hypothesis  is  rejected;  these  two  introns  are  not  homologous. 

This  hypothesis  test  is  performed  for  all  possible  corresponding  intron  pairs.  The  S- 
Plus  function  to  compute  the  Chi-square  test  statistic  is  provided  in  appendix  E.  Results, 
by  gene,  are  provided  in  tables  C.l  -  C.33;  white  cells  denote  statistical  homology  while 
gray  cells  indicate  a  lack  of  homology. 

Some  interesting  results  can  be  noted  in  these  tables.  Two  genes  may  have  some 
intronic  regions  which  are  statistically  different  while  others  show  evidence  of  homology. 
Numerous  gene  pairs  are  completely  non-homologous,  while  no  two  genes  show 
evidence  of  complete  homology. 

Of  even  greater  interest  is  the  fact  that  many  of  the  oncogenes  appear  to  have  more 
similarity  with  several  non-regulatory  genes  than  other  oncogenes.  This  will  be  further 
examined  after  the  clustering  process. 


The  clustering  process  described  in  Chapter  IV  is  performed,  with  several  subtle 
differences.  Since  I  provide  my  own  distance  matrix  which  is  already  squared,  the 
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nosquare  option  in  SAS  is  chosen.  The  SAS  code  for  this  analysis  is  provided  in  appendix 
D.  M^en  one  uses  a  one-mode  structure,  the  ability  to  use  the  EML  clustering  process  is 
lost.  This  is  because  EML  relies  on  the  number  of  variables  (i.e.,  coordinate  data)  for  its 
calculation.  This  information  is  not  available  when  a  unique  distance  matrix  is 
incorporated.  For  the  same  reason,  the  ability  to  compute  the  pseudo  F  and  pseudo  is 

lost  for  all  methods  except  Ward’s,  average  linkage,  and  centroid  clustering.  Results 
from  the  pseudo  F  and  pseudo  for  these  methods  are  shown  in  table  6.3.  This  means 
that  there  is  less  information  from  which  to  draw  a  consensus  as  to  the  optimal  number  of 
clusters.  From  the  information  remaining,  there  appears  to  be  a  consensus  that  the 
optimal  number  of  clusters  is  2. 


Table  6.3:  Optimal  Number  of  Clusters 
Based  on  Mean  Chi-square  Homology 
Statistics 


Method 

Pseudo  F 

Pseudo  C 

Ward's 

2 

2,3,8 

Flexible-Beta 

NA 

NA 

Average 

2,7,14 

2,7,10,12 

Centroid 

2,4,6,9,14 

2,5,12,14 

Complete 

NA 

NA 

EML 

McQuitty's 

NA 

NA 

Median 

NA 

NA 

Single 

NA 

NA 

Clusters  based  on  Ward’s,  flexible-beta  and  average  linkage  are  shown  in  figures  6.1, 
6.2,  and  6.3  respectively.  As  there  is  not  100  percent  consensus  between  methods  as  to 
cluster  members,  table  6.4  summarizes  the  cluster  members  by  method. 
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Cluster  1 


Figure  6.1:  Clusters  Based  on  Homology  Chi-Square  Test  Results  Using  Ward’s  Minimum  Variance  Cluster  Analysis 


Cluster  2 


Figure  6.2:  Clusters  Based  on  Homology  Chi-Square  Test  Results  Using  Beta-Flexible  Cluster  Analysis 


Clusters  Based  on  Homology  Chi-Square  Test  Results  Using  Average  Linkage  Cluster  Analysis 


Table  6.4:  Comparison  of  Cluster  Members  for 
Preferred  Methods  Based  on  Mean  Chi-Square 
Homology  Results  After  Comparing  Introns 


Between  Genes 


Cluster 

Ward's 

Average 

Beta- 

Flexible 

LCK(O) 

LCK(O) 

LCK(O) 

WNT»1(0) 

WNT-l(O) 

WNT-l(O) 

MAX(O) 

MAX(O) 

MAX(O) 

RAFAl(O) 

RAFAl(O) 

RAFAl(O) 

BCR(O) 

BCR(O) 

BCR(O) 

FMS(O) 

FMS(O) 

FMS(O) 

HST-l(O) 

HST-l(O) 

HST-l(O) 

PIMl(O) 

PIMl(O) 

PIMl(O) 

FPS(O) 

FPS(O) 

FPS(O) 

BCL-3(0) 

BCL-3(0) 

BCL-3(0) 

FOS(O) 

FOS(O) 

FOS(O) 

Cluster  1 

N-MYC(0) 

N-MYC(O) 

N-MYC(O) 

MYC(O) 

MYC(0) 

MYC(0) 

L.MYC(O) 

L-MYC(O) 

L-MYC(O) 

KIT(O) 

KIT(O) 

KIT(O) 

ABL(O) 

ABL(O) 

K-RAS2(0) 

K-RAS2(0) 

INT-2(0) 

NFl(TS) 

NFl(TS) 

MSH2(TS) 

MSH2(TS) 

MLHl(TS) 

MLHl(TS) 

MTS  1  (IS) 

MTSl(TS) 

MTSl(TS) 

G6PD(NR) 

G6PD{NR) 

G6PD(NR) 

GCK(NR) 

GCK(NR) 

GCK{NR) 

PFK(NR) 

PFK(NR) 

PFK{NR) 

HKII(NR) 

HKII(NR) 

HKII(NR) 

TPI(NR) 

TPI(NR) 

TPI(NR) 

GAPDH(NR) 

GAPDH(NR) 

GAPDH(NR) 

GTP(NR) 

GTP(NR) 

SDHB(NR) 

SDHB(NR) 

PGK(NR) 

PGK(NR) 

ADOL(NR) 

ADOL(NR) 

INT-2(0) 

INT-2(0) 

ABL(O) 

K-RAS2(0) 

RB(TS) 

NFl(TS) 

Cluster  2 

MSH2(TS) 

MLHl(TS) 

GTP(NR) 

SDHB(NR) 

PGK(NR) 

ADOL(NR) 

Interpretation  of  Clustering 

The  beta-flexible  method  and  the  average  linkage  cluster  analysis  provide  similar 
clusterings.  Ward’s  minimum  variance  cluster  analysis  provides  slightly  different 
clusters  from  these  two  methods.  A  possible  explanation  for  this  difference  is  the  bias 
Ward’s  method  has  demonstrated  in  simulation  studies  toward  clusters  of  equal  size.  For 
the  other  two  methods,  cluster  2  only  has  one  gene,  while  the  cluster  2  based  on  Ward’s 
method  has  ten  genes.  Although  there  is  not  complete  consensus  on  cluster  members, 
there  are  some  very  informative  results.  Note  that  four  out  of  five  tumor  suppressor 
genes  form  a  partition  within  one  of  the  clusters.  Despite  the  lack  of  consensus 
concerning  the  other  partition  with  which  the  tumor  suppressors  eventually  join,  they 
remain  together  for  all  clustering  methods.  It  is  also  of  interest  that  the  MYC  genes  fuse 
into  the  same  partition,  although  other  genes  fuse  with  them. 

There  are  some  non-regulatory  genes  that  group  closely  with  oncogenes.  Other  non- 
regulatory  genes  combine  to  form  an  almost  totally  non-regulatory  partition  within  cluster 
2  (only  oncogene  ABL  joins  them).  It  is  possible  that  the  inclusion  of  ABL  is  due  to  a 
totally  random  occurrence.  However,  it  is  also  possible  that  this  result  might  be 
biologically  meaningful.  The  tables  in  appendix  C  provide  many  examples  of  oncogenes 
which  are  consistently  more  similar  to  a  certain  non-regulatory  gene  than  to  the  other 
oncogenes.  The  possible  biological  meaning  behind  the  relationship  (if  any)  between 
these  two  gene  types  is  unknown. 
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CHAPTER  VII 


CONCLUSIONS 

In  this  chapter,  the  strengths  and  weaknesses  of  analyzing  data  using  cluster  analysis 
are  discussed.  The  biological  implications  of  the  clustering  results  are  also  reviewed. 
Finally,  some  areas  for  further  research  are  suggested. 

Statistical  Conclusions 

When  properly  used,  cluster  analysis  allows  the  determination  of  possible  natural 
groupings  within  data.  However,  cluster  analysis  is  often  improperly  applied,  usually 
because  conclusions  are  based  on  only  one  method  (often  that  which  best  represents  a 
desired  conclusion).  Simulation  studies  can  provide  recommendations  concerning  the 
optimal  clustering  method  for  various  cluster  sizes  and  shapes.  But  these  simulation 
studies  assume  the  user  knows  the  sizes  and  shapes  of  the  clusters  in  advance  and  this 
information  is  rarely  obtainable. 

A  partial  solution  is  to  look  for  consensus  among  several  methods.  If  they  all  concur, 
there  is  a  greater  possibility  that  a  natural  ordering  occurs  within  the  data.  However,  this 
involves  a  considerable  time  investment  to  analyze  all  possible  methods  for  all  possible 
variable  configxirations.  Cluster  analysis  is  not  an  expedient  process.  Additionally,  some 
of  the  better  software  packages  for  this  analysis,  for  example,  SAS,  do  not  have  the 
graphics  capability  to  provide  visualization  of  the  clusters.  This  encourages  the 
uninformed  user  to  use  software  offering  weaker  clustering  methods  but  better  graphics. 


for  example,  S-Plus. 


SAS  provided  statistical  tools  (i.e.,  the  pseudo  F  and  pseudo  t^)  for  determining  the 
optimal  number  of  clusters.  This  was  of  considerable  benefit  because  these  two  tests  had 
performed  admirably  in  Milligan’s  simulation  studies.  But  neither  is  a  perfected  analytic 
technique.  No  critical  value  is  provided  that  conclusively  resolves  the  proper  number  of 
clusters  issue  for  the  researcher.  Unless  pseudo  F  and  pseudo  results  from  numerous 
methods  are  compared  for  consensus,  the  optimal  number  of  clusters  becomes  distorted 
due  to  inconsistent  information.  This  makes  cluster  analysis  appear  as  more  of  an  art 
form  than  a  statistical  tool.  There  are  times  when  it  is  impossible  to  determine  the 
optimal  number  of  clusters. 

This  is  not  to  say  that  cluster  analysis  does  not  have  benefits.  It  provides  the 
opportunity  to  look  within  groups  to  see  the  ordering  of  the  objects.  This  is  not  possible 
with  traditional  statistical  analyses.  In  ANOVA,  MANOVA,  and  hypothesis  testing,  the 
groups  are  predetermined  by  the  statistician.  Inherent  in  these  processes  is  a  controlled 
allocation  of  group  members.  Whether  these  groups  represent  natural  groupings  is 
unresolved. 

Another  benefit  of  cluster  analysis  is  prediction  capability.  Once  ordering  within 
clusters  is  established,  cluster  analysis  may  be  used  to  predict  placement  of  new  objects 
into  their  proper  clusters. 

Most  cluster  analyses  consist  of  calculating  a  Euclidean  distance  matrix  from 
descriptive  features.  Creation  of  a  distance  matrix  based  on  statistical  test  results  may 
bring  more  statistical  knowledge  into  the  clustering  process,  but  the  issue  of  whether  that 
helps  form  more  stable  clusters  if  unknown. 
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Riological  Conclusions 


Clustering  of  tumor  suppressor  genes  and  the  MYC  family  based  on  the  homology 
matrix  is  biologically  significant.  The  clustering  of  oncogenes  and  non-regulatory  genes 
based  on  number  of  introns,  and  to  a  lesser  degree,  results  of  the  means  of  intronic  log 
lengths  combined  with  the  number  of  introns,  also  indicate  the  occurrence  of  non-random 
processes.  This  finding  supports  the  biological  hypothesis  that  introns  may  have  a 
specific  purpose  and  they  may  group  by  degree  of  regulatory  function. 

The  issue  of  the  oncogenes  consistently  clustering  closely  with  non-regulatory  genes 
based  on  the  homology  matrix  is  an  interesting  result.  It  may  indicate  that  there  is  not  a 
substantial  difference  between  these  two  groups.  Hence,  only  the  sequences  of  the  tumor 
suppressor  genes  differ  significantly  from  the  other  two  groups  based  on  sequence 
content.  Another  possibility  is  there  is  a  closer  relationship  between  oncogenes  and  non- 
regulatory  genes  than  was  previously  thought.  A  possible  hypothesis  is  that  the  intronic 
regions  of  the  oncogenes  somehow  influence  (maybe  by  transposons)  the  non-regulatory 
intronic  regions.  Why  the  intronic  regions  of  these  different  gene  groups  are  statistically 
similar  is  of  great  interest.  Although  statistical  significance  is  not  a  guarantee  of 
biological  homology,  it  is  usually  considered  to  be  an  acceptable  indicator  of  the 
existence  of  a  relationship.  A  second  possibility  is  that  there  are  not  yet  enough  genes  in 
the  sample  for  the  distinctions  between  oncogenes  and  non-regulatory  genes  to  be 
manifest. 

The  indication  of  Markov  processes  in  some  introns  and  not  in  others  within  the  same 

gene  is  also  of  biological  importance.  It  is  possible  that  introns  displaying  Markov 

processes  are  more  regulatory,  while  introns  displaying  randomness  are  less  important  to 
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gene  function.  Different  introns  may  have  displayed  different  levels  of  regulation  at 
different  evolutionary  times.  The  results  in  this  thesis  also  explain,  to  some  degree,  the 

C; 

inconsistency  in  the  literature  as  to  whether  Markov  processes  occur  in  DNA  sequences. 
We  observed  67  percent  Markov  processes  and  33  percent  randomness.  Depending  upon 
the  random  sample  taken,  it  is  possible  that  either  outcome  could  occur. 

Areas  of  Future  Interest 

As  more  DNA  data  are  sequenced,  these  results  can  be  re-evaluated.  It  is  possible 
that  the  clusters  will  separate  more  distinctly  by  regulatory  group  as  the  gene  sample  size 
increases.  A  look  at  clustering  based  only  upon  homology  within  the  longer  DNA 
intronic  regions  may  also  be  of  value.  There  are  some  intronic  regions  (e.g.,  intron  1) 
which  are  atypically  long  in  many  of  the  genes.  Since  they  are  noticeably  different  in 
length,  they  warrant  further  investigation.  It  will  be  interesting  to  see  whether  future 
laboratory  studies  and  DNA  analyses  support  our  conclusions. 

Use  of  Billingsley’s  homology  statistic  can  be  extended  to  three  or  more  genes.  This 
indicates  the  statistical  possibility  of  multi-dimensional  clustering.  It  might  be  possible, 
using  the  results  of  this  statistic,  to  create  multi-dimensional  distance  matrices.  How 
these  would  compare  with  the  two  dimensional  ones  might  lend  more  insight  into  the 
methodology  behind  present  clustering  processes. 

The  use  of  the  homology  statistic  can  be  extended  to  higher  order  Markov  processes. 

It  was  not  possible  in  this  thesis  because  there  were  many  introns  that  were  short,  creating 
too  many  zero  cells  in  higher  order  transition  matrices.  This  approach  could  be 
investigated  on  the  longer  introns. 
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Results  from  a  test  of  homology  were  used  as  a  distance  measure.  1  applied  the  same 
hypothesis  test  to  all  comparable  intronic  regions  within  a  gene  pair,  regardless  of 
whether  they  showed  evidence  of  a  zero-order  or  a  first-order  Markov  process.  For  the 
comparison  of  two  intronic  regions  where  both  regions  are  composed  of  independent 
sequences,  one  would  normally  do  a  Chi-square  test  of  independence  without  the 
assumption  of  a  Markov  process.  For  the  case  where  one  intronic  region  is  a  zero-order 
Markov  process  and  one  intronic  region  is  a  first  order  Markov  process,  it  might  be 
appropriate  to  not  compare  the  two  sequences  without  knowledge  of  the  underlying 
distribution.  A  methodology  that  incorporates  the  optimal  hypothesis  testing  approach 
and  provides  insight  into  an  underlying  distribution  for  the  test  statistics  would  be 
advantageous.  This  methodology  should  result  in  the  same  degrees  of  freedom  for  all 
hypothesis  test  results  so  that  the  results  can  be  used  as  a  distance  measure.  In  general, 
more  methods  that  incorporate  ordered  data  should  be  included  into  the  clustering 
methodology  and  more  statistical  rigor  should  brought  into  the  process. 

Overall  Conclusion 

In  conclusion,  I  have  provided  a  methodology  that  allows  the  first  systematic 
comparison  of  information  within  and  between  genes  of  various  regulatory  groupings. 
This  offers  an  alternative  to  the  current  methodologies.  The  greatest  strength  of  the 
methodology  is  the  ability  to  apply  it  to  the  clustering  of  intronic  regions,  regions  that 
have  previously  been  extremely  difficult  to  analyze  and  compare. 
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APPENDIX  A 


MATERIAL  TO  SUPPLEMENT  CHAPTER  IV 


Figure  A.l:  Graph  of  Intronic  Lengths  for  Oncogenes  Truncated  at  16,000  bp 


Figure  A.2:  Graph  of  Intronic  Lengths  for  Tumor  Suppressor  Genes  Truncated  at  16,000  bp 
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gure  A.3:  Graph  of  Intronic  Lengths  for  Non-Regulatory  Genes  Truncated  at  16,000  bp 
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Figure  A.4:  Graph  of  Intronic  Lengths  for  All  Genes  in  Data  Set  Truncated  at  16,000  bp 
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Figure  A.5:  Graph  of  Intronic  Log  Lengths  for  Oncogenes 


Figure  A.6:  Graph  of  Intronic  Log  Lengths  for  Tumor  Suppressor  Genes 


Figure  A.7:  Graph  of  Intronic  Log  Lengths  for  Non-Regulatory  Genes 
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Figure  A.8:  Graph  of  Intronic  Log  Lengths  for  All  Genes  in  Data  Set 


Table  A.1:  Means  and  Standard  Deviations  of  Intronic  Lengths  and  Intronic  Log  Lengths 
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APPENDIX  B 


MATERIAL  TO  SUPPLEMENT  CHAPTER  V 


Table  B.l:  Markov  Results  Superimposed  over  Lengths  of  DNA  Sequence  Data  Available  For  Introns  in  Each  Gene 
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Table  B.l  (Continued):  Markov  Results  Superimposed  over  Lengths  of  DNA  Sequence  Data  Available  for  Introns 
in  Each  Gene _  _ 
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Table  B.l  (Continued):  Markov  Results  Superimposed  over  Lengths  of  DNA  Sequence  Data  Available  For  Introns 
in  Each  Gene 


102 


(HNPlDtHl 


APPENDIX  C 


MATERIAL  TO  SUPPLEMENT  CHAPTER  VI 


105 


Table  C.3:  FMS(O)  Gene  (21  Introns)  Versus  Contrast  Genes  (Gray  Cells  Reject  Ho  at  a=  0.01  Level 
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Table  C.4:  BCL-3(0)  Gene  (8  Introns)  Versus  Contrast  Genes  (Gray  Cells  Reject  Ho  at  a  =  0.01  Level) 
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Table  C.9:  MYC(O)  Gene  (2  Introns)  Versus  Contrast  Genes  (Gray  Cells  Reject  Ho  at  a  =  0.01  Level) 
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Table  C.ll:  N-MYC(O)  Gene  (2  Introns)  Versus  Contrast  Genes  (Gray  Cells  Reject  Ho  at  a  =  0.01  Level) 
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HKII(NR) _ 17 _ 2  20.88  TiiTl  14.181  9.48 


Table  C.13:  PIMl(O)  Gene  (5  Introns)  Versus  Contrast  Genes  (Gray  Cells  RejectHoat^a^^O^l  Level) 
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Table  C.14  (Continued):  KIT(O)  Gene  (20  Introns)  Versus  Contrast  Genes  (Gray  Cells  Reject  Ho  at  a=  0.01  Level ) 
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Table  C.16;  FPS(O)  Gene  (18  Introns)  Versus  Contrast  Genes  (Gray  Cells  Reject  Ho  at  a—  0.01  Levelj__ _ 

Contrast  U  Introns  in  #  Introns 
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Table  C.17:  WNT-l(O)  Gene  (3  Introns)  Versus  Contrast  Genes  (Gray  Cells  Reject  Ho  at  a  =  0.01  Level) 
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Table  C.18:  K-RAS2(0)  Gene  (4  Introns)  Versus  Contrast  Genes  (Gray  Cells  Reject  Ho  at  a  —  0.01  Level) 
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Table  C.19;  MLHl(TS)  Gene  (18  Introns)  Versus  Contrast  Genes  (Gray  Cells  Reject  Ho  at  a=  0.01  Level )  _ 
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Table  C.19  (Continued):  MLH1(TS)  Gene  (18  Introns)  Versus  Contrast  Genes  (Gray  Cells  Reject  Ho  at  a==  0.01  Level ) 
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Table  C.20:  MSH2(TS)  Gene  (15  Introns)  Versus  Contrast  Genes  (Gray  Cells  Reject  Ho  at  a=  0.01  Level )  _ 

Contrast  U  Introns  in  #  Introns 
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Table  6.20  (Continued):  MSH2(TS)  Gene  (15  Introns)  Versus  Contrast  Genes  (Gray  Cells  Reject  Ho  at  a=  0.01  Level ) 

Contrast 

Gene  Intron  12  Intron  13  Intron  14  Intron  15  Mean  SD 


!?  S  S  2  !g  2  $  SR  S 


£  ^  ^  91  S  SR  2  ^  2  $  ^  9  2  $  ^  i  S  ^  ^  S  2 


^  a.  fn 

S  s 
%  2 


2  a  s  «  ^  h  ^  =  h  -’  h  s  2  2  h  K  2  2  h5  a  2  «  «  - 


^  '^y-N  00  O^ 

o§£i'o|l||??||o5o2  |||^||||ll 

<abnbsS^SJiz!SSSKb^i^SStfzUb<(;upu 


TPI(NR)  6  3  56.76  13.31  15.00  28.35  24.61 

GCK(NR)  9 _ 3  50.61  15.79  5(^11  41.08  22.13 

GAPDH(NR) _ 8 _ 3 _ 4449  28.99  12.93  28.87  15.88 

SDHB(NR) _ 7 _ 3  ,  29.15  24.23  38.29  20.24 

HK1I(^) _ 17 _ 3  .3i.9i  25.57  13.24|  24.60|  10.90 


Tqble  C.22;  RB(TS)  Gene  (26  Introns)  Versus  Contrast  Genes  (Gray  Cells  Reject  Ho  at  a  =  0.01  Level) 
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Table  C.23  (Continued):  NFl(TS)  Gene  (56  Introns)  Versus  Contrast  Genes  (Gray  Cells  Reject  Ho  at  a  =  0.01  Level) 
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Table  C>24;  G6PD(NR)  Gene  (12  Introns)  Versus  Contrast  Genes  (Gray  Cells  Reject  Ho  at  a  =  0.01  Level)) 
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Table  C.25:  PGK(NR)  Gene  (10  Introns)  Versus  Contrast  Genes  (Gray  Cells  Reject  Ho  at  a  =  0.01  Level)) 
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GCK(NR) - 9  9  S4.S4  50.87  58.22  47.88  43.62  38.08  2933  83.a3p42i4^^^M  49.86  15.24 
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Table  C.26:  ADOL(NR)  Gene  (8  Introns)  Versus  Contrast  Genes  (Gray  Cells  Reject  Ho  at  a  —  0.01  Level) 
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Table  C.27:  GTP(NR)  Gene  (8  Introns)  Versus  Contrast  Genes  (Gray  Cells  Reject  Ho  at  a  —  0.01  Level) 

Contrast  U  Introns  in  #  Introns 

Gene  Contrast  Gene  Compared  Intron  1  Intron  2  Intron  3  Intron  4  Intron  5  Intron  6  Intron  7  Intron  8  Mean 
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GCK(NR)  '  9  '  8  3O6  38.<i0  32.82  23.01  .54.94  10.15  >Hj^33  TOM  36.14  18.72 

GAPDH(NR)  8  8  64.4|  11.61  18.58  23.29  8.54  24.43^^18^  28.63  21.30 


Table  C.28;  PFK(NR)  Gene  (21  Introns)  Versus  Contrast  Genes  (Gray  Cells  Reject  Ho  at  a=  0.01  Level )  _ 

Contrast  #  Introns  in  #  Introns 

Gene  Contrast  Gene  Compared  Intron  1  Intron  2  Intron  3  Intron  4  Intron  5  Intron  6  Intron  7 


Table  C.28  (Continued):  PFK(NR)  Gene  (21  Introns)  Versus  Contrast  Genes  (Gray  Cells  Reject  Ho  at  a=  0.01  Level ) 
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Table  C.30;  GCK(NR)  Gene  (9  Introns)  Versus  Contrast  Genes  (Gray  Cells  Reject  Ho  at  a  =  0.01  Level)) 
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TPI(NR)  ~  6 _ 6  84.83  10.64  38^8  16.95  M3|  32.27  27.69 

GAPDH(NR)  _ 8 _ 8 _  121,12  18.92  9.17  20.25  17.07  9.22  1437[2&^^^^^^M  29.63  37.43 

SDHB(NR)  7  ~  7  69;67  48.46  53.19  40.53  30.98  28.66  27.66  42.74  15.47 

HKII(NR)  _ 17 _ 9  ^^^^  2^^  18.59  36.SS  30.16  41^71)  19.04  16.52|;  36.SSI  18.62|  27.13|  9.50 


Table  C.31:  GAPDH(NR)  Gene  (8  Introns)  Versus  Contrast  Genes  (Gray  Cells  Reject  Ho  at  a  -  0.01  Level) 

Contrast  #  Introns  in  #  Introns  I 

Gene  Contrast  Gene  Compared  Intron  1  Intron  2  Intron  3  Intron  4  Intron  5  Intron  6  Intron  7  Intron  8  j  Mean  S 


TPI(NR)  ~  6 _ 6  26.65  12.89  20.77  14.87 _ li  16.61  6.15 

GCK(NR)  ~  9 _ 8 _ 121.12  18.92  9.17  20.25  17.07  9.22  14J7r^26^  29.63  37.43 

SDHB(NR)  7  7 _ 66.06  5^^  16.42  23.24  9.96  18.99  23.26|^B||^B  30.36  21.22 

HKII(NR)  17  '  8  33.41  20.77  11.09  21.55  15.53  _^7j7  ^^6;24j___23;74)  20.03 1  6.70 


Table  6.32;  SDHB(NR)  Gene  (7  Introns)  Versus  Contrast  Genes  (Gray  Cells  Reject  Ho  at  a  -  0.01  Level) 

Contrast  #  Introns  in  #  Introns 

Gene  Contrast  Gene  Compared  Intron  1  Intron  2  Intron  3  Intron  4  Intron  5  intron  6  Intron  7  Mean  SD 
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TPI(NR)  6  6  60.86  20.52  15.4  21.83  12.66  20l8^|^B  25.29  17.78 

GCK(NR)  ~  9 _ 7 _ 69.67  48.46  53.19  40.53  30.98  28.66  2746  42.74  15.47 

GAPDH(NR) _ 8 _ 7  66.06  54.59  16.42  23.24  9.96  18.99  23.26  ..  30.36  21.22 

HKII(NR)  ~~  17  7  ~  26.tB  18.93  21.33|  13.61 1  26.431  29.671  23.84|  22.96|  5.47 


Table  C.33;  HKII(O)  Gene  (17  Introns)  Versus  Contrast  Genes  (Gray  Cells  Reject  Ho  at  a=  0-01  Level )  _ 

Contrast  #  Introns  in  #  Introns  I 

Gene  Contrast  Gene  Compared  Intron  I  Intron  2  Intron  3  Intron  4  Intron  5  Intron  6 1  Intron  7  Intron  8 1  Intron  9  Intron  I0|  Intron  II 


TPI(NR)  6  6  2S.I7  12.90  18.05  15.76  21.69 

GCK(NR)  9  ~  9  '  26.48  18.59  36.55  30.16  4i;70  19.04  16.52  T  36^1  18.62 

GAPDH(NR)  8  8  33.41  20.77  11.09  21.55  15.53  17.87  16.24  ^2£74B^MM 

SDHB(NR)  '  7 _ 7  26.«8  18.93|  21.33|  13.6l|  UMl  2M7J 


Table  C.33  (Continued):  HKII(O)  Gene  (17  Introns)  Versus  Contrast  Genes  (Gray  Cells  Reject  Ho  at  a-  0.01  Level ) 

Contrast 

Gene  Intron  12  Intron  13  Intron  14  Intron  15  Intron  16  Intron  17  Mean  SD 
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APPENDIX  D 


SAS  Code 


148 


Macro  to  Analyze  Number  of  Introns,  Means  of  Intronic  Log  Lengths  and 
Standard  Deviations  of  Intronic  Log  Lengths  in  a  Gene 


title  'Intron  Attributes  in  Contrast  Genes'; 
data  genepool; 


input  gene 

&  $12. 

intnum  intmn 

intsd  Inintmn 

.  lnintsd@@; 

cards ; 

ABL(O) 

10 

22724.3 

62361.11 

7.957296 

1.837927 

BCR(O) 

22 

5943.0 

15089.64 

7.530286 

1.304262 

FMS (0) 

21 

2622.095 

5632.928 

6.676911 

1.572582 

BCL-3 (0) 

8 

1218.625 

1676.15 

6,408406 

1.237932 

FOS (0) 

3 

432.6667 

319.5033 

5.8088 

0.969894 

HST-1 (0) 

2 

577.5 

55.86144 

6.3564 

0.096874 

INT-2 (0) 

2 

3968.5 

2375.172 

8.1875 

0.638659 

LCK(O) 

11 

986.0909 

1867.0 

5.710842 

1.401179 

MYC(O) 

2 

1500.0 

175.3625 

7.30975 

0.117168 

L-MYC(O) 

2 

1667.5 

1843.427 

6.94695 

1.484571 

N-MYC (0) 

2 

1765.0 

1231.78 

7.33635 

0.764595 

MAX(O) 

1 

483.0 

. 

6.18 

. 

PIMl (0) 

5 

514.4 

622.4386 

5.565122 

1.311433 

KIT(O) 

20 

1574,167 

1408.351 

6.788102 

1.296034 

RAFAl (0) 

15 

554.0 

583.6962 

5.69871 

1.187702 

FPS (0) 

18 

473.5556 

531.8399 

5.680134 

0.973222 

WNT-1 (0) 

3 

625.6667 

•  141.8462 

6.419653 

0.246156 

K-RAS2 (0) 

4 

8979.75 

6414.456 

8.796863 

1.021532 

MLHl (TS) 

18 

3069.667 

2774.064 

7.654244 

1.020609 

MSH2 (TS) 

15 

4573.333 

4755.218 

8.053468 

0.836118 

MTSl (TS) 

3 

536.6667 

266.5827 

6.170747 

0.632422 

RB(TS) 

26 

6644.962 

14556.52 

7.754034 

1.459014 

NFl (TS) 

56 

4285.979 

13155.08 

6.99366 

1.412317 

G6PD(NR) 

12 

1102.5  ' 

2765.385 

5.749208 

1.33521 

PGK(NR) 

10 

2092.1 

2110.976 

6.998464 

1.293345 

ADOL{NR) 

8 

1581.75 

1516.879 

7.05172 

0.798806 

GTP(NR) 

8 

315.0 

148.8057 

5.607898 

0.634424 

PFK{NR) 

21 

1182.857 

1915.214 

6.406921 

1.139428 

TPI (NR) 

6 

341.1667 

413.5773 

5.370938 

0.985794 

GCK(NR) 

9 

2534.667 

2694.33 

7.486829 

0.8163 

GAPDH (NR) 

8 

321.5 

533.1883 

5.145963 

0.982632 

SDHB(NR) 

.7 

4950.0 

3979.426 

8.159084 

0.981201 

HKII (NR) 

17 

2360.059 

3918.626 

6.737161 

1.504913 

title  ’Cluster  Analysis  Based  on  Number  of  Introns,  Mean  and  SD  of  Log 
Lengths  for  33  Contest  Genes’; 

%macro  analyze (method, ncl) ; 

proc  cluster  data=genepool  out=tree  method=&method  std  ccc  pseudo; 
var  intnum  Inintmn  Inintsd; 
id  gene; 
title2; 

%let  k=l; 

%let  n=%scan (&ncl, &k) ; 
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%do  %while ( &n^=) ; 

proc  tree  data=tree  noprint  out=out  ncl=&n; 
copy  intnum  intmn; 
proc  plot; 

plot  intnum*intmn=cluster  /  hpos=86  vpos=26; 
title2  "Plot  of  &n  Clusters  from  METHOD=&METHOD" 
run; 

%let  k-%eval{&k+l) ; 

%let  n=%scan ( &ncl, &k) ; 

%end; 

%mend; 

%analyze (average,  ) 

%analyze (centroid, ) 

%analyze (complete, ) 
lanalyze (eml, ) 

%analyze (flexible, ) 

%analyze (mcquitty, ) 

%analyze (median, ) 

%analyze ( single, ) 

%analyze (ward, ) 


SAS  Code  to  Perform  Cluster  Analysis  Based  on  Mean  Chi-square  Homology 
Statistics  After  Comparing  Introns  Between  Genes 


options  center  ls=80; 

data  homology (type=distance)  ; 

infile  *  c: \msof f ice\excel\frdecl . txt ’  delimiter^*  @  *  missover  lrecl=330; 
input  ab'cdefghij  klmnopqrstuvwxyzaaabacadae 
af  ag  genname  $  ; 
run; 

proc  cluster  data=homology  method=average  pseudo  nosquare  ; 

id  genname; 

run; 

proc  cluster  data=homology  method=centroid  pseudo  nosquare  ; 

id  genname; 

run; 

proc  cluster  data=homology  method=single  nosquare  ; 

id  genname; 

run; 

proc  cluster  data=homology  method=complete  nosquare  ; 

id  genname; 

run; 

proc  cluster  data=homology  method=flexible  nosquare  ; 

id  genname; 

run; 

proc  cluster  data-homology  method=ward  pseudo  nosquare  ; 

id  genname; 

run; 

proc  cluster  data=homology  method=Mcquitty  nosquare  ; 

id  genname; 

run; 

proc  cluster  data=homology  method=median  pseudo  nosquare  ; 

id  genname; 

run; 
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APPENDIX  E 


S-PLUS  CODE 


S-Plus  Code  to  Create  a  Markov  Chain  Once  the  DNA  Sequence  Has  Been  Read 
Into  ‘xtest’  from  a  Text  File  (Code  is  for  a  Chain  50  bp  long) 

>x0_xtest[-50] 

>xl_xtest[-l] 

rbind[x0,xlj 

/*The  above  code  is  more  efficient  than  that  traditional  long  way  (an  iterative  loop).  The 
only  problem  is  that  there  is  a  limit  on  how  much  code  the  rbind  procedure  can  handle.  A 
DNA  seqence  longer  than  10,000  bp  must  be  broken  apart  in  order  to  use  the  above 
procedure.  In  this  case,  the  user  may  prefer  to  create  the  chain  using  the  iterative  loop 
(even  though  it  will  be  fairly  slow).  The  code  for  the  long  way  is  also  provided*/ 

>M_matrix(0,4,4) 

>for  (i  in  1 :49)  { 

+a_xtest[i] 

+b_xtest[i+l] 

+M[a,bLM[a,b]+l} 
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S-Plus  Function  to  Test  for  Independence  Versus  a  Markov  Process 


>  mark.test 
function(x) 

{ 

zij  <«•  matrix(0,  ncol  =  4,  nrow  =  4) 
for(i  in  1 :4)  { 

for(j  in  1 :4)  { 

<-(((x[i,j])-(( 

suni(x[i,  ]))  *  ((sum(x[ 
,j]))/(sum(x)))))^2)/(( 
sum(x[i,  ]))  *  ((sum(x[ 
,j]))/(sum(x)))) 
siim(zij) 


} 


} 


154 


S-Plus  Function  to  Compute  Chi-square  Test  Statistics  from  a  Test  for  Homology 
Between  Intron  x  and  Intron  y 


>  homl.test 
fxinction(x,  y) 

{ 

zij  <-  matrix(0,  ncol  =  4,  nrow  =  4) 
for(i  in  1 :4)  { 

for(j  in  1 :4)  { 

<-  (((sum(x[i,  ]))  *  (sum(y[i,  ])))/ 
((x[i,j])  +  (y[ij])))*((((x[i,j])/ 
(sum(x[i,  ])))-((y[i,j])/(sum(y[i, 

])))r2) 

} 

} 

sum(zij) 
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